# DATA 2013 Abstracts

## Area 1 - Business Analytics

Full Papers
Paper Nr: 8
Title:

### Parameterised Fuzzy Petri Nets for Knowledge Representation and Reasoning

Authors:

#### Zbigniew Suraj

Abstract: The paper presents a new methodology for knowledge representation and reasoning based on parameterised fuzzy Petri nets. Recently, this net model has been proposed as a natural extension of generalised fuzzy Petri nets. The new class extends the generalised fuzzy Petri nets by introducing two parameterised families of sums and products, which are supposed to provide the suitable t-norms and s-norms. The nature of the fuzzy reasoning realised by a given net model changes variously depending on t- and/or s-norms to be used. However, it is very difficult to select a suitable t- and/or s-norm function for actual applications. Therefore, we proposed to use in the net model parameterised families of sums and products, which nature change variously depending on the values of the parameters. Taking into account this aspect, we can say that the parameterised fuzzy Petri nets are more flexible than the classical fuzzy Petri nets, because they allow to define the parameterised input/output operators. Moreover, the choice of suitable operators for a given reasoning process and the speed of reasoning process are very important, especially in real-time decision support systems. Some advantages of the proposed methodology are shown in its application in train traffic control decision support.

Paper Nr: 29
Title:

### Subspace Clustering with Distance-density Function and Entropy in High-dimensional Data

Authors:

#### Jiwu Zhao and Stefan Conrad

Abstract: Subspace clustering is an extension of traditional clustering that enables finding clusters in subspaces within a data set, which means subspace clustering is more suitable for detecting clusters in high-dimensional data sets. However, most subspace clustering methods usually require many complicated parameter settings, which are almost troublesome to determine, and therefore there are many limitations in applying these subspace clustering methods. In our previous work, we developed a subspace clustering method Automatic Subspace Clustering with Distance-Density function (ASCDD), which computes the density distribution directly in high-dimensional data sets by using just one parameter. In order to facilitate choosing the parameter in ASCDD we analyze the relation of neighborhood objects and investigate a new way of determining the range of the parameter in this article. Furthermore, we will introduce here a new method by applying entropy in detecting potential subspaces in ASCDD, which evidently reduces the complexity of detecting relevant subspaces.

Paper Nr: 39
Title:

### Enhancing Collaboration in Big Biomedical Data Settings - Knowledge Visualization, Data Mining and Decision Making Issues

Authors:

#### Nikos Karacapilidis, Georgia Tsiliki and Manolis Tzagarakis

Abstract: Biomedical researchers need to efficiently and effectively collaborate and make decisions by meaningfully assembling, mining and analyzing available large-scale volumes of complex multi-faceted data residing in different sources. Arguing that dealing with data-intensive and cognitively complex settings is not a technical problem alone, this paper reports on the development and practical use of an innovative web-based collaboration support service in a biomedical research context. The proposed service builds on the synergy between machine and human intelligence to facilitate and augment the underlying knowledge management, data mining and decision making processes. Evaluation results indicate that the service enables stakeholders to make more informed decisions, by displaying the aggregated information according to their needs.

Paper Nr: 66
Title:

### Rating of Discrimination Networks for Rule-based Systems

Authors:

#### Fabian Ohler, Kai Schwarz, Karl-Heinz Krempels and Christoph Terwelp

Abstract: The amount of information stored in a digital form grows on a daily basis but is mostly only understandable by humans, not machines. A way to enable machines to understand this information is using a representation suitable for further processing, e. g. frames for fact declaration in a Rule-based System. Rule-based Systems heavily rely on Discrimination Networks to store intermediate results to speed up the rule processing cycles. As these Discrimination Networks have a very complex structure it is important to be able to optimize them or to choose one out of many Discrimination Networks based on its structural efficiency. Therefore, we present a rating mechanism for Discrimination Networks structures and their efficiencies. The ratings are based on a normalised representation of Discrimination Network structures and change frequency estimations of the facts in the working memory and are used for comparison of different Discrimination Networks regarding processing costs.

Short Papers
Paper Nr: 15
Title:

### Advanced Analytics with the SAP HANA Database

Authors:

#### Philipp Große, Wolfgang Lehner and Norman May

Abstract: Complex database applications require complex custom logic to be executed in the database kernel. Traditional relational databases lack an easy to-use programming model to implement and tune such user defined code, which motivates developers to use MapReduce instead of traditional database systems. In this paper we discuss four processing patterns in the context of the distributed SAP HANA database that even go beyond the classic MapReduce paradigm. We illustrate them using some typical Machine Learning algorithms and present experimental results that demonstrate how the data flows scale out with the number of parallel tasks.

Paper Nr: 26
Title:

### Predicting Cases of Ambulatory Care Sensitive Conditions

Authors:

#### W. Haque and D. C. Finke

Abstract: Proper management of ambulatory care sensitive conditions does not only enhance patient care, but also reduces healthcare costs by minimizing hospitalizations. In order to strategically allocate resources, it is essential to rely on informed forecasting decisions. Among other factors, the healthcare data is deeply affected by seasonality, granularity, missing information and the sheer volume. We have used the ten-year history from a Discharge Abstract Database to build predictive models and perform multi-dimensional analysis on key metrics such as age, gender, and demographics. The valuable insights suggest that investments in some areas appear to be working and should continue whereas other areas suggest a need for reallocation of resources. The results have been confirmed using two distinct time series models. The forecasted data is integrated with existing data and presented to users through data visualization tools with capabilities to drill down to reports of finer granularity. It is observed that though some diagnoses appear to be on an upward trend in prevalence over the next few years, other ACSC-related diagnoses will continue to occur with either the same or slightly less frequency.

Paper Nr: 33
Title:

### Consistency of Incomplete Data

Authors:

#### Patrick G. Clark and Jerzy Grzymala-Busse

Abstract: In this paper we introduce an idea of consistency for incomplete data sets. Consistency is well-known for completely specified data sets, where a data set is consistent if for any two cases with equal all attribute values, both cases belong to the same concept. We generalize the definition of consistency for incomplete data sets using rough set theory. For incomplete data sets there exist three definitions of consistency. We discuss two types of missing attribute values: lost values and do not care'' conditions. We illustrate an idea of consistency for incomplete data sets using experiments on many data sets with missing attribute values derived from five benchmark data sets. Results of our paper may be applied for increasing the efficiency of mining incomplete data.

Paper Nr: 59
Title:

### R-Pref: Rapid Prototyping of Database Preference Queries in R

Authors:

#### Patrick Roocks and Werner Kießling

Abstract: Preferences are a well-established framework for database queries with soft constraints. Such queries select the best objects from large data sets according to a strict partial order induced by intuitive and semantically rich preference constructors. Together with functionality like grouping and aggregation, adapted from well-known database mechanisms, a very flexible preference framework has emerged in the last decade. In this paper we present R-Pref, an implementation of the preference framework in the statistical computing language R. R-Pref comprises less than 1000 lines of code and adheres to the formal foundations of preferences. It allows rapid prototyping of new preferences and related concepts. Exemplarily we present a use case in which a simple text mining example based on pattern matching is enriched by preferences. We argue that R-Pref paves the way for rapidly exploring new fields of application for preferences. Especially new semantic constructs for preference related operations together with equivalences of preference terms, being highly important for optimization, can be quickly evaluated.

Paper Nr: 67
Title:

### A New Addressing Scheme for Discrimination Networks easing Development and Testing

Authors:

#### Karl-Heinz Krempels, Fabian Ohler and Christoph Terwelp

Abstract: Rule Based Systems and Databased Management Systems are important tools for data storage and processing. Discrimination Networks (DNs) are an efficient way of matching conditions on data. DNs are based on the paradigm of dynamic programming and save intermediate computing results in network nodes. Therefore, an efficient scheme for addressing the data in the used memories is required. Currently used schemes are efficient but sophisticated in operation hindering the development of new approaches for structural and functional optimization of DNs. We introduce and discuss a new addressing scheme for fact referencing in DNs with aim to ease the development of optimization approaches for DNs. The scheme uses fact addresses computed from sets of edges between the nodes in a DN to reference data.

Posters
Paper Nr: 3
Title:

### A MapReduce Architecture for Web Site User Behaviour Monitoring in Real Time

Authors:

#### Bill Karakostas and Babis Theodoulidis

Abstract: Monitoring the behaviour of large numbers of web site users in real time poses significant performance challenges, due to the decentralised location and volume of generated data. This paper proposes a MapReduce-style architecture where the processing of event series from the Web users is performed by a number of cascading mappers, reducers and rereducers, local to the event origin. With the use of static analysis and a prototype implementation, we show how this architecture is capable to carry out time series analysis in real time for very large web data sets, based on the actual events, instead of resorting to sampling or other extrapolation techniques.

Paper Nr: 12
Title:

### Enhancing News Articles Clustering using Word N-Grams

Authors:

#### Christos Bouras and Vassilis Tsogkas

Abstract: In this work we explore the possible enhancement of the document clustering results, and in particular clustering of news articles from the web, when using word-based n-grams during the keyword extraction phase. We present and evaluate a weighting approach that combines clustering of news articles derived from the web using n-grams, extracted from the articles at an offline stage. We compared this technique with the single minded bag-of-words representation that our clustering algorithm, W-kmeans, previously used. Our experimentation revealed that via tuning of the weighting parameters between keyword and n-grams, as well as the n itself, a significant improvement regarding the clustering results metrics can be achieved. This reflects more coherent clusters and better overall clustering performance.

Paper Nr: 35
Title:

### Database Functionalities for Evolving Monitoring Applications

Authors:

#### Philip Schmiegelt, Jingquan Xie, Gereon Schüller and Andreas Behrend

Abstract: Databases are able to store, manage, and retrieve large amounts and a broad variety of data. However, the task of understanding and reacting to the data is often left to tools or user applications outside the database. As a consequence, monitoring applications are often relying on problem-specific imperative code for data analysis, scattering the application logic. This usually leads to island solutions which are hard to maintain, give raise to security and performance problems due to the separation of data storage and analysis. In this paper, we identify missing database functionalities which overcome these problems by allowing data processing on a higher level of abstraction. Such functionalities would allow to employ a database system even for the complex analysis tasks required in evolving monitoring scenarios.

Paper Nr: 38
Title:

### Effective Business Plan Evaluation using an Evolutionary Ensemble

Authors:

#### G. Dounias, A. Tsakonas, D. Charalampakis and E. Vasilakis

Abstract: The paper proposes the use of evolving intelligent techniques, for effective business decision making related to strategic management. Under the current competitive environment, business plans appraisal arises as an important task for bankers, investors, venture capital fund managers and consultants among others. The process of business plans assessment requires various technical competencies, market awareness and adequate experience, thus increasing the relevant operating costs. A conceptual model for the evaluation of business plans is being proposed, with the use of both numerical and qualitative parameters, clustered under four headings. The input data is processed with the comparative use of ensembles of evolutionary classifiers, and an intelligent model of business plans’ appraisal is built. The reliability and the accuracy of the results are considered satisfactory by the subject matter experts.

Paper Nr: 63
Title:

### Estimate the Market Share from the Search Engine Hit Counts

Authors:

#### Robert Viseur

Abstract: The knowledge of the competitive environment (and, in particular, market share) is an important factor in the management of innovation. This type of information is not always accessible to small and medium enterprises. In addition, some sectors are changing rapidly under the pressure of technological change. We propose in this research a method for estimating the market share based on media share, based on the hit counts returned by search engines for each brand. We show the potential of this approach with a real example (the automotive industry) and discuss the limitations associated with the operating mode of search engines.

## Area 2 - Data Management and Quality

Full Papers
Paper Nr: 14
Title:

### A Generic and Flexible Framework for Selecting Correspondences in Matching and Alignment Problems

Authors:

#### Fabien Duchateau

Abstract: The Web 2.0 and the inexpensive cost of storage have pushed towards an exponential growth in the volume of collected and produced data. However, the integration of distributed and heterogeneous data sources has become the bottleneck for many applications, and it therefore still largely relies on manual tasks. One of this task, named matching or alignment, is the discovery of correspondences, i.e., semantically-equivalent elements in different data sources. Most approaches which attempt to solve this challenge face the issue of deciding whether a pair of elements is a correspondence or not, given the similarity value(s) computed for this pair. In this paper, we propose a generic and flexible framework for selecting the correspondences by relying on the discriminative similarity values for a pair. Running experiments on a public dataset has demonstrated the improvment in terms of quality and the robustness for adding new similarity measures without user intervention for tuning.

Paper Nr: 36
Title:

### Automatic Synthesis of Data Cleansing Activities

Authors:

#### Mario Mezzanzanica, Roberto Boselli, Mirko Cesarini and Fabio Mercorio

Abstract: Data cleansing is growing in importance among both public and private organisations, mainly due to the relevant amount of data exploited for supporting decision making processes. This paper is aimed to show how model-based verification algorithms (namely, model checking) can contribute in addressing data cleansing issues, furthermore a new benchmark problem focusing on the labour market dynamic is introduced. The consistent evolution of the data is checked using a model defined on the basis of domain knowledge. Then, we formally introduce the concept of universal cleanser, i.e. an object which summarises the set of all cleansing actions for each feasible data inconsistency (according to a given consistency model), then providing an algorithm which synthesises it. The universal cleanser can be seen as a repository of corrective interventions useful to develop cleansing routines. We applied our approach to a dataset derived from the Italian labour market data, making the whole dataset and outcomes publicly available to the community, so that the results we present can be shared and compared with other techniques.

Short Papers
Paper Nr: 11
Title:

### A Clustering Topology for Wireless Sensor Networks - New Semantics over Network Topology

Authors:

#### Paul Cotofrei, Ionel Tudor Calistru and Kilian Stoffel

Abstract: Sensor networks are a primary source of massive amounts of data about the real world that surrounds us, measuring a wide range of physical parameters in real time. Given the hardware limitations and physical environment in which the sensors must operate, along with frequent changes of network topology, algorithms and protocols must be designed to provide a robust and energy efficient communications mechanism. With a view to addressing these constraints, this paper proposes a routing technique that is based on density based spatial clustering of applications with noise (DBSCAN) algorithm. This technique reveals several network topology semantics, enables the splitting of sensors responsibilities (communication/routing and sensing/monitoring), reduces the level of energy wasted on sending messages through the network by data aggregation only in cluster-head nodes and last but not the least, brings along very good results prolonging the network lifetime.

Paper Nr: 28
Title:

### An Elastic Cache Infrastructure through Multi-level Load-balancing

Authors:

#### Carlos Lübbe and Bernhard Mitschang

Abstract: An increasing demand for geographic data compels data providers to handle an enormous amount of range queries at their data tier. In answer to this, frequently used data can be cached in a distributed main memory store in which the load is balanced among multiple cache nodes. To make appropriate load-balancing decisions, several key-indicators such as expected and actual workload as well as data skew can be used. In this work, we make use of an abstract mathematical model to consolidate these indicators. Moreover, we propose a multi-level load-balancing algorithm which considers the different indicators in separate stages. Our evaluation shows that our multi-level approach significantly improves the resource utilization in comparison to existing technology.

Paper Nr: 54
Title:

### Approaching ETL Conceptual Modelling and Validation using BPMN and BPEL

Authors:

#### Bruno Oliveira and Orlando Belo

Abstract: Data warehousing systems have reached their maturity for a long time ago in the area of decision support systems. From dimensional modelling to query optimization there are a lot of topics in the field that were already systematically studied and explored. However, ETL (Extract-Transform-Load) stills suffer from a lack of a simple and rigorous approach for modelling and validation of populating processes for data warehouses. In spite of being done significant efforts by researchers in that area, there is not yet a convinced and simply approach for modelling (conceptual and logical views), validating and testing an ETL process (or a group) before conducting it to implementation and roll-out. In this paper we explore the use of BPMN for ETL conceptual modelling, and develop the bridges to translate conceptual models to BPEL and to BPMN 2.0 later, in order to test and validate correctness and effectiveness of ETL processes designed, based on those two approaches for process model execution. We intend to provide a set of BPMN meta-models especially designed to map standard data warehousing ETL processes.

Paper Nr: 58
Title:

### Citable by Design - A Model for Making Data in Dynamic Environments Citable

Authors:

#### Stefan Pröll and Andreas Rauber

Abstract: Data forms the basis for research publications. But still the focus of researchers is a paper based publication, data is rather seen as a supplement that could be offered as a download, often without further comments. Yet validation, verification, reproduction and re-usage of existing knowledge can only be applied when the research data is accessible and identifiable. For this reason, precise data citation mechanisms are required, that allow reproducing experiments with exactly the same data basis. In this paper, we propose a model that enables to cite, identify and reference specific data sets within their dynamic environments. Our model allows the selection of subsets that support experiment verification and result re-utilisation in different contexts. The approach is based on assigning persistent identifiers to timestamped queries which are executed against time-stamped and versioned databases. This facilitates transparent implementation and scalable means to ensure identical result sets being delivered upon re-invocation of the query.

Posters
Paper Nr: 20
Title:

### Data Management for M2M Communication using Telecom Mediation Systems

Authors:

#### Sandeep Akhouri and Kirti Girdhar

Abstract: Mediation systems are an integral part of a telecom carrier’s Business Support System / Operational Support System (BSS / OSS) landscape. They are responsible for collecting, transforming and consolidating massive volumes of data from a diverse set of Network Elements (NE) across a range of protocols. Traditionally, mediation systems support revenue management functions such as charging and billing. Unlike standard data warehousing applications that are based on Extract-Transform-Lead (ETL) paradigm, mediation systems are focused on rapid processing of events in real-time or near real-time. This paper proposes the use of mediation system as a data management platform for Machine-to-Machine (M2M) communication. Mediation systems inherently deliver scalable solutions for handling huge volumes of sensor data. They can connect across a diverse set of M2M communication protocols to collect, consolidate, process sensor data and feed it downstream to decision support systems for actionable intelligence. This paper briefly explores the integration of mediation systems with stream mining applications to classify and cluster data-streams.

Paper Nr: 22
Title:

### Data Quality Evaluation of Scientific Datasets - A Case Study in a Policy Support Context

Authors:

#### Antonella Zanzi and Alberto Trombetta

Abstract: In this work we present the rule-based approach used to evaluate the quality of scientific datasets in a policy support context. The used case study refers to real datasets in a context where low data quality limits the accuracy of the analysis results and, consequently, the significance of the provided policy advice. The applied solution consists in the identification of types of constraints that can be useful as data quality rules and in the development of a software tool to evaluate a dataset on the basis of a set of rules expressed in the XML markup language. As rule types we selected some types of data constraints and dependencies already proposed in data quality works, but we experimented also the use of order dependencies and existence constraints. The case study was used to develop and test the adopted solution, which is anyway generally applicable to other contexts.

Paper Nr: 25
Title:

### Social Data Sentiment Analysis in Smart Environments - Extending Dual Polarities for Crowd Pulse Capturing

Authors:

#### Athena Vakali, Despoina Chatzakou, Vassiliki Koutsonikola and Georgios Andreadis

Abstract: Social networks drive todays opinion and content diffusion. Humans interact in social media on the basis of their emotional states and it is important to capture people emotional scales for a particular theme. Such interactions are facilitated and become evident in smart environments characterized by mobile devices and new smart city contexts. This work proposes a sentiment analysis approach which extends positive and negative polarity in higher and wider emotional scales to offer new smart services over mobile devices. A particular methodology and a generic framework is outlined along with indicative mobile applications which employs microblogging data analysis for chosen topics, locations and time. These applications capture crowd pulse as expressed in microblogging platforms and such an analysis is beneficial for various communities such as policy makers, authorities and the public.

Paper Nr: 57
Title:

### Towards a Second Generation of Computer Interpretable Guidelines

Authors:

#### Paolo Terenziani, Alessio Bottrighi, Laura Giordano, Giuliana Franceschinis, Stefania Montani, Luigi Portinale and Daniele Theseider Dupre

Abstract: Computer Interpretable Guidelines (CIG) are an emerging area of research, to support medical decision making through evidence-based recommendations. However, new challenges in the data management field have to be faced, to integrate CIG management with a proper treatment of patient data, and of other forms of medical knowledge (e.g., causal and behavioral knowledge). In this position paper, we summarize a proposal for a research agenda that, in our opinion, can lead to a significant advancement in the field. The goal of the work is to provide suitable models and reasoning methodologies to cope with the aforementioned aspects, and to properly integrate them for medical decision support. Achieving such a goal requires advances in data management, and, in particular, in the treatment of indeterminate valid-time data in relational databases, of temporal abstraction on time series, of case retrieval on time series, of design-time and run-time model-based verification of guidelines, of case-based reasoning, of non-monotonic logics, of formal ontologies, of probabilistic graphical models (Bayesian Networks and Influence Diagrams).

Paper Nr: 60
Title:

### Data Curation Framework for Facilities Science

Authors:

#### Vasily Bunakov and Brian Matthews

Abstract: The trend in research data management practice is that the role of large facilities represented by particle accelerators, neutron sources and other scientific instruments of scale extends beyond providing capabilities for the raw data collection and its initial processing. Managing data and publications catalogues, shared software repositories and sophisticated data archives have become common responsibilities of the research facilities. We suggest that facilities can further move from managing data to curating them which implies meaningful data enrichment, annotation and linkage according to the best practices which have emerged in the facilities science itself or have been borrowed elsewhere. We discuss the challenges and opportunities that are the drivers for this role transformation, and suggest a data curation framework harmonized with the research lifecycle in facilities science.

## Area 3 - Ontologies and the Semantic Web

Short Papers
Paper Nr: 19
Title:

### EDEX: Entity Preserving Data Exchange

Authors:

#### Yoones A. Sekhavat and Jeffrey Parsons

Abstract: Data Exchange creates an instance of a target schema from an instance of a source such that source data is reflected in the target instance. The prevailing approach for data exchange is based on generating and using schema mapping expressions representing high level relations between source and target. We show such class level schema mappings cannot resolve some ambiguous cases. We propose an Entity Preserving Data Exchange (EDEX) method that reflects source entities in the target independent of classification of entities. We show EDEX can reconcile such ambiguities while generates a core solution as an efficient solution.

Paper Nr: 40
Title:

### Semantic Copyright Management of Media Fragments

Authors:

#### Roberto García, David Castellà and Rosa Gil

Abstract: The amount of media in the Web poses many scalability issues and among them copyright management. This problem becomes even bigger when not just the copyright of pieces of content has to be considered, but also media fragments. Fragments and the management of their rights, beyond simple access control, are the centrepiece for media reuse. This can become an enormous market where copyright has to be managed through the whole value chain. To attain the required level of scalability, it is necessary to provide highly expressive rights representations that can be connected to media fragments. Ontologies provide enough expressive power and facilitate the implementation of copyright management solutions that can scale in such a scenario. The proposed Copyright Ontology is based on Semantic Web technologies, which facilitate implementations at the Web scale, can reuse existing recommendations for media fragments identifiers and interoperate with existing standards. To illustrate these benefits, the papers presents a use case where the ontology is used to enable copyright reasoning on top of DDEX data, the industry standard for information exchange along media value chains.

Posters
Paper Nr: 43
Title:

### Designing a Farmer Centred Ontology for Social Life Network

Authors:

#### Anusha Indika Walisadeera, Gihan Nilendra Wikramanayake and Athula Ginige

Abstract: Rapid adoption of mobile phones has vastly improved access to information. Yet finding the information within the context in which information is required in a timely manner is a challenge. To investigate some of the underlying farmer centric research challenges a large International Collaborative Research Project to develop mobile based information systems for people in developing countries has been launched. One major sub project is to develop a Social Life Network; a mobile based information system for farmers in Sri Lanka. Lack of timely information with respect to their preferences and needs to support farming activities is creating many problems for farmers in Sri Lanka. For instance, farmers need agricultural information within the context of location of their farm land, their economic condition, their interest and beliefs, and available agricultural equipment. As a part of this project we investigated how we can create a knowledge repository of agricultural information to respond to user queries taking into account the context in which the information is needed. Because of the complex nature of the relationships among various concepts we selected an ontological approach that supports first order logic to create the knowledge repository. We first identified set of questions that reflect various motivation scenarios. Next we created a model to represent user context. Then we developed a novel approach to derive the competency questions incorporating user context. These competency questions were used to identify the concepts, relationships and axioms to develop the ontology. Initial system was trialled with a group of farmers in Sri Lanka. There was universal agreement among the farmers participated in the field trial to varying degree (strongly agree, agree, moderately agree) to the question “All information for the crop selection stage is provided”.

Paper Nr: 64
Title:

### Extraction of Biographical Data from Wikipedia

Authors:

#### Robert Viseur

Abstract: Using the content of Wikipedia articles is common in academic research. However the practicalities are rarely analysed. Our research focuses on extracting biographical information about personalities from Belgium. Our research is divided into three sections. The first section describes the state of the art for data extraction from Wikipedia. A second section presents the case study about data extraction for biographies of Belgian personalities. Different solutions are discussed and the solution adopted is implemented. In the third section, the quality of the extraction is discussed. Practical recommendations for researchers wishing to use Wikipedia are also proposed on the basis of our case study.

## Area 4 - Databases and Data Security

Full Papers
Paper Nr: 18
Title:

Authors:

Abstract: .

Short Papers
Paper Nr: 16
Title:

### Design and Evaluation of a Graph Codec System for Software Watermarking

Authors:

#### Maria Chroni and Stavros D. Nikolopoulos

Abstract: In this paper, we propose an efficient and easily implemented codec system for encoding watermark numbers as graph structures through the use of self-inverting permutations. More precisely, based on the fact that a watermark number $w$ can be efficiently encoded as self-inverting permutation $\pi^*$, we present an efficient encoding algorithm which encodes a self-inverting permutation $\pi^*$ as a reducible flow-graph $F[\pi^*]$ and a decoding algorithm which extracts the permutation $\pi^*$ from the graph $F[\pi^*]$. Our codec algorithms are very simple, use elementary operations on sequences and linked structures, and the produced flow-graph $F[\pi^*]$ does not differ from the graph data structures built by real programs. Moreover, our codec algorithms have very low time and space complexity and the flow-graph $F[\pi^*]$ incorporates important structural properties which cause it resilient to attacks. We have evaluated several components of our codec system in a simulation environment in order to obtain a clear view of their practical behaviour; the experimental results show that we can decide with high probability whether the graph $F[\pi^*]$ suffer an attack on its edges.

Paper Nr: 32
Title:

### Highly Scalable Sort-merge Join Algorithm for RDF Querying

Authors:

#### Zbyněk Falt, Miroslav Čermák and Filip Zavoral

Abstract: In this paper, we introduce a highly scalable sort-merge join algorithm for RDF databases. The algorithm is designed especially for streaming systems; besides task and data parallelism, it also tries to exploit the pipeline parallelism in order to increase its scalability. Additionally, we focused on handling skewed data correctly and efficiently; the algorithm scales well regardless of the data distribution.

Posters
Paper Nr: 10
Title:

### Frame Time and Cardinality Indeterminacy in Temporal Relational Databases

Authors:

#### Paolo Terenziani

Abstract: Time is pervasive of reality, and many relational database approaches have been developed to cope with it. However, in practical applications, temporal indeterminacy about the exact time of occurrence of facts and, possibly, about the number of occurrences, may arise. Coping with such phenomena requires an in-depth extension of current techniques. In this paper, we have introduced a new data model, and new definitions of relational algebraic operators coping with the above issues, and we have studied operator reducibility.

Paper Nr: 21
Title:

### SylvaDB: A Polyglot and Multi-backend Graph Database Management System

Authors:

#### Javier de la Rosa, Juan Luis Suárez and Fernando Sancho Caparrini

Abstract: This paper presents SylvaDB, a graph database management system designed to be used by people with no technical knowledge. SylvaDB is based on flexible schema definitions and has been developed taking into account the need to deal with semantic information. It relies on the mathematical notion of property graph. SylvaDB is an open source project and aims at lowering the barrier of adoption for anyone using graph databases. At the same time, it is robust and scalable enough to support collaborative large projects related to knowledge management, document archiving, and research.