DATA 2018 Abstracts


Area 1 - Big Data

Full Papers
Paper Nr: 17
Title:

Detecting and Analyzing Privacy Leaks in Tweets

Authors:

Paolo Cappellari, Soon Chun and Christopher Costello

Abstract: Social network platforms are changing the way people interact not just with each other but also with companies and institutions. In sharing information on these platforms, users often underestimate potential consequences, especially when such information discloses personal information. For such reason, actionable privacy awareness and protection mechanisms are becoming of paramount importance. In this paper we propose an approach to assess the privacy content of the social posts with the goal of: protecting the users from inadvertently disclosing sensitive information, and rising awareness about privacy in online behavior. We adopt a machine learning approach based on a crowd-sourced definition of privacy that can assess whether messages are disclosing sensitive information. Our approach can automatically detect messages carrying sensitive information, so to warn users before sharing a post, and provides a set of analysis to rise users awareness about online behavior related to privacy disclosure.

Paper Nr: 27
Title:

Clustering Big Data

Authors:

Michele Ianni, Elio Masciari, Giuseppe M. Mazzeo and Carlo Zaniolo

Abstract: The need to support advanced analytics on Big Data is driving data scientist’ interest toward massively parallel distributed systems and software platforms, such as Map-Reduce and Spark, that make possible their scalable utilization. However, when complex data mining algorithms are required, their fully scalable deployment on such platforms faces a number of technical challenges that grow with the complexity of the algorithms involved. Thus algorithms, that were originally designed for a sequential nature, must often be redesigned in order to effectively use the distributed computational resources. In this paper, we explore these problems, and then propose a solution which has proven to be very effective on the complex hierarchical clustering algorithm CLUBS+. By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion. Experimental results confirm the accuracy and scalability of CLUBS+ on Map-Reduce platforms.

Short Papers
Paper Nr: 13
Title:

Anomaly Detection for Industrial Big Data

Authors:

Neil Caithness and David Wallom

Abstract: As the Industrial Internet of Things (IIoT) grows, systems are increasingly being monitored by arrays of sensors returning time-series data at ever-increasing ‘volume, velocity and variety’ (i.e. Industrial Big Data. An obvious use for these data is real-time systems condition monitoring and prognostic time to failure analysis (remaining useful life, RUL). (e.g. See white papers by Senseye.io, Prognostics - The Future of Condition Monitoring, and output of the NASA Prognostics Center of Excellence (PCoE)). However, as noted by others, our ability to collect “big data” has greatly surpassed our capability to analyze it. In order to fully utilize the potential of Industrial Big Data we need data-driven techniques that operate at scales that process models cannot. Here we present a prototype technique for data-driven anomaly detection to operate at industrial scale. The method generalizes to application with almost any multivariate data set based on independent ordinations of repeated (bootstrapped) partitions of the data set and inspection of the joint distribution of ordinal distances.

Paper Nr: 30
Title:

BRAID - A Hybrid Processing Architecture for Big Data

Authors:

Corinna Giebler, Christoph Stach, Holger Schwarz and Bernhard Mitschang

Abstract: The Internet of Things is applied in many domains and collects vast amounts of data. This data provides access to a lot of knowledge when analyzed comprehensively. However, advanced analysis techniques such as predictive or prescriptive analytics require access to both, history data, i. e., long-term persisted data, and real-time data as well as a joint view on both types of data. State-of-the-art hybrid processing architectures for big data—namely, the Lambda and the Kappa Architecture—support the processing of history data and real-time data. However, they lack of a tight coupling of the two processing modes. That is, the user has to do a lot of work manually in order to enable a comprehensive analysis of the data. For instance, the user has to combine the results of both processing modes or apply knowledge from one processing mode to the other. Therefore, we introduce a novel hybrid processing architecture for big data, called BRAID. BRAID intertwines the processing of history data and real-time data by adding communication channels between the batch engine and the stream engine. This enables to carry out comprehensive analyses automatically at a reasonable overhead.

Paper Nr: 42
Title:

Exploring Urban Mobility from Taxi Trajectories: A Case Study of Nanjing, China

Authors:

Yihong Yuan and Maël Le Noc

Abstract: Identifying urban mobility patterns is a crucial research topic in geographic information science, transportation planning, and behavior modeling. Understanding the dynamics of daily mobility patterns is essential for the management and planning of urban facilities and services. Previous studies have utilized taxi trajectories collected from the Global Positioning System (GPS) to model various types of urban patterns, such as identifying urban functional regions and hot spots. However, there is limited research on how the results of these studies can be used to inform real-world problems in urban planning. This research examines the development of sub-centers in Nanjing, China based on Taxi GPS trajectories. The results indicate a clear separation between the urban center and the sub-centers. In addition, we also clustered the time series of taxi pick-up locations to model dynamic urban movement and identify outlier patterns. The results demonstrate the importance of considering human mobility patterns in identifying urban functional regions, which provides valuable input for urban planners and policy makers.

Area 2 - Business Analytics

Full Papers
Paper Nr: 28
Title:

Combining Prediction Methods for Hardware Asset Management

Authors:

Alexander Wurl, Andreas Falkner, Alois Haselböck, Alexandra Mazak and Simon Sperl

Abstract: As wrong estimations in hardware asset management may cause serious cost issues for industrial systems, a precise and efficient method for asset prediction is required. We present two complementary methods for forecasting the number of assets needed for systems with long lifetimes: (i) iteratively learning a well-fitted statistical model from installed systems to predict assets for planned systems, and - using this regression model - (ii) providing a stochastic model to estimate the number of asset replacements needed in the next years for existing and planned systems. Both methods were validated by experiments in the domain of rail automation.

Paper Nr: 46
Title:

An Approach for Adaptive Parameter Setting in Manufacturing Processes

Authors:

Sonja Strasser, Shailesh Tripathi and Richard Kerschbaumer

Abstract: In traditional manufacturing processes the selection of appropriate process parameters can be a difficult task which relies on rule-based schemes, expertise and domain knowledge of highly skilled workers. Usually the parameter settings remain the same for one production lot, if an acceptable quality is reached. However, each part processed has its own history and slightly different properties. Individual parameter settings for each part can further increase the quality and reduce scrap. Machine learning methods offer the opportunity to generate models based on experimental data, which predict optimal parameters depending on the state of the produced part and its manufacturing conditions. In this paper, we present an approach for selecting variables, building and evaluating models for adaptive parameter settings in manufacturing processes and the application to a real-world use case.

Short Papers
Paper Nr: 35
Title:

Variable Importance Analysis in Default Prediction using Machine Learning Techniques

Authors:

Başak Gültekin and Betül Erdoğdu Şakar

Abstract: In this study, different data mining techniques were applied to a real bank credit data set from a public bank to provide an automated and objective credit scoring. Two-step methodology was used for objective credit scoring: Determining the variables to be included in the model and deciding on the model to classify the potential credit application as “bad credit (default)” or “good credit (not default)”. The phrases “bad credit” and “good credit” are used as class labels since they are used like this in banking jargon in Turkey. For this two-step procedure, different variable selection algorithms like Random Forest, Boruta and machine learning algorithms like Logistic Regression, Random Forest, Artificial Neural Network were tried. At the end of the feature selection phase, CRA_Score and III_Score variables were determined as most important variables. Moreover, occupation and bank product number were also predictor variables. For the classification phase, Neural Network model was the best model with higher accuracy and low average square error also Random Forest model better resulted than Logistic Regression model.

Paper Nr: 41
Title:

Automatic Document Summarization based on Statistical Information

Authors:

Aigerim Mussina, Sanzhar Aubakirov and Paulo Trigo

Abstract: This paper presents a comparative perspective in the field of automatic text summarization algorithms. The main contribution is the implementation of well-known algorithms and the comparison of different summarization techniques on corpora of news articles parsed from the web. The work compares three summarization techniques based on TextRank algorithm, namely: General TextRank, BM25, LongestCommonSubstring. For experiments, we used corpora based on news articles written in Russian and Kazakh. We implemented and experimented well-known algorithms, but we evaluated them differently from previous work in summary evaluation. In this research, we propose a summary evaluation method based on keywords extracted from the corpora. We describe the application of statistical information, show results of summarization processes and provide their comparison.

Posters
Paper Nr: 19
Title:

No Target Function Classifier - Fast Unsupervised Text Categorization using Semantic Spaces

Authors:

Tobias Eljasik-Swoboda, Michael Kaufmann and Matthias Hemmje

Abstract: We describe a Text Categorization (TC) classifier that does not require a target function. When performing TC, there is a set of predefined, labeled categories that the documents need to be assigned to. Automated TC can be done by either describing fixed classification rules or by applying machine learning. Machine learning based TC usually occurs in a supervised learning fashion. The learner generally uses example document-to-category assignments (the target function) for training. When TC is introduced for any application or when new topics emerge, such examples are not easy to obtain because they are time-intensive to create and can require domain experts. Unsupervised document classification eliminates the need for such training examples. We describe a method capable of performing unsupervised machine learning-based TC. Our method provides quick, tangible classification results that allow for interactive user feedback and result validation. After uploading a document, the user can agree or correct the category assignment. This allows our system to incrementally create a target function that a regular supervised learning classifier can use to produce better results than the initial unsupervised system. To do so, the classifications need to be performed in a time acceptable for the user uploading documents. We based our method on word embedding semantics with three different implementation approaches; each evaluated using the reuters21578 benchmark (Lewis, 2004), the MAUI citeulike180 benchmark (Medelyan et al., 2009), and a self-compiled corpus of 925 scientific documents taken from the Cornell University Library arXiv.org digital library (Cornell University Library, 2016). Our method has the following advantages: Compared to key word extraction techniques, our system can assign documents to categories that are labeled with words that do not literally occur in the document. Compared to usual supervised learning classifiers, no target function is required. Without the requirement of a target function the system cannot overfit. Compared to document clustering algorithms, our method assigns documents to predefined categories and does not create unlabeled groupings of documents. In our experiments, the system achieves up to 66.73 % precision, 41.8 % recall and 41.09% F1 (all reuters21578) using macroaveraging. Using microaveraging, similar effectiveness is obtained. Even though these results are below those of contemporary supervised classifiers, the system can be adopted in situations where no training data is available or where text needs to be assigned to new categories capturing freshly emerging knowledge. It requires no manually collected resources and works fast enough to gather feedback interactively thereby creating a target function for a regular classifier.

Paper Nr: 24
Title:

Towards Enabling Emerging Named Entity Recognition as a Clinical Information and Argumentation Support

Authors:

Christian Nawroth, Felix Engel, Tobias Eljasik-Swoboda and Matthias L. Hemmje

Abstract: In this paper we discuss the challenges of growing amounts of clinical literature for medical staff. We introduce our concepts emerging Named Entity (eNE) and emerging Named Entity Recognition (eNER) and show the results of an empirical study on the incidence of eNEs in the PubMed document set, which is the main contribution of this article. We discuss how emerging Named Entities can be used for Argumentation Support, Information Retrieval (IR) Support and Trend Analysis in Clinical Virtual Research Environments (VREs) dealing with large amounts of medical literature. Based on the empirical study and the discussion we derive use cases and a data science and user-feedback based architecture for the detection and the use of eNEs for IR and Argumentation Support in clinical VREs, like the related project RecomRatio.

Paper Nr: 40
Title:

Towards a Data Warehouse Architecture for Managing Big Data Evolution

Authors:

Darja Solodovnikova and Laila Niedrite

Abstract: The problem of designing data warehouses in accordance with user requirements and adapting its data and schemata to changes in these requirements as well as data sources has been studied by many researchers worldwide in the context of relational database environments. However, due to the emergence of big data technologies and the necessity to perform OLAP analysis over big data, innovative methods must be developed also to support evolution of data warehouse that is used to analyse big data. Therefore, the main objective of this paper is to propose a data warehousing architecture over big data capable of automatically or semi-automatically adapting to user needs and requirements as well as to changes in the underlying data sources.

Paper Nr: 44
Title:

Development of an Online-System for Assessing the Progress of Knowledge Acquisition in Psychology Students

Authors:

Thomas Ostermann, Jan Ehlers, Michaela Warzecha, Gregor Hohenberg and Michaela Zupanic

Abstract: Results of summative examinations represent most often only a snapshot of the knowledge of students over a part of the curriculum and do not provide valid information on whether a long-term retention of knowledge and knowledge growth takes place during the course of studies. Progress testing allows the repeated formative assessment of students’ functional knowledge and consists of questions covering all domains of relevant knowledge from a given curriculum. This article describes the development and structure of an online platform for progress testing in psychology at the Witten/Herdecke University. The Progress Test Psychology (PTP) was developed in 2015 in the Department of Psychology and Psychotherapy at Witten/Herdecke University and consists of 100 confidence-weighted true-/false-items (sure / unsure / don’t know). The Online-System for implementation of the PTP was developed based on XAMPP including an Apache Server, a MySQL-Database, PHP and JavaScript. First results of a longitudinal survey show the increase in student’s knowledge in the course of studies also reliably reflects the course of the curriculum. Thus, content validity of the PTP could be confirmed. Apart from directly measuring the long-term retention of knowledge the use of the PTP in the admission of students applying for a Master’s program is discussed.

Paper Nr: 52
Title:

Fuzzy Metadata Strategies for Enhanced Data Integration

Authors:

Hiba Khalid, Esteban Zimanyi and Robert Wrembel

Abstract: The problem of data integration is one of the most debated issues in the general field of data management. Data Integration is typically accompanied by a concept of conflict management. The problem’s root arises from different data sources and the probability of how each data source corresponds to another. Metadata is also another important yet, highly overlooked concept in these research areas. In this paper we propose the idea of leveraging metadata as a binding source in the process of integration. The research technique relies on exploiting textual metadata from different sources by using Fuzzy logic as a coherence measure. A framework methodology has been devised for understanding the power of textual metadata. The framework operates on multiple data sources typically a data source set can contain ‘n’ number of datasets. In case of considering two data sources the sources can be titled as primary and secondary. The primary secondary source is the accepting data source and thus contains more enriched metadata. The secondary sources are the requesting sources for integration and are also guided by textual data summaries, keywords, analysis reports etc. The Fuzzy MD framework operates on finding similarities between primary and secondary metadata sources using fuzzy matching and string exploration. The model then provides the probable answer for each set’s association with the primary accepting source. The framework relies on origin of words and relative associativity rather than the common approach of manual metadata enrichment. This not only resolves the argument of manual metadata enrichment, it also provides a hidden solution for generating metadata from scratch as a part of the integration and analysis process.

Area 3 - Soft Computing in Data Science

Full Papers
Paper Nr: 3
Title:

Synthetic Optimisation Techniques for Epidemic Disease Prediction Modelling

Authors:

Terence Fusco, Yaxin Bi, Haiying Wang and Fiona Browne

Abstract: In this paper, research is presented for improving optimisation performance using sparse training data for disease vector classification. Optimisation techniques currently available such as Bayesian, Evolutionary and Global optimisation and are capable of providing highly efficient and accurate results however, performance potential can often be restricted when dealing with limited training resources. In this study, a novel approach is proposed to address this issue by introducing Sequential Model-based Algorithm Configuration(SMAC) optimisation in combination with Synthetic Minority Over-sampling Technique(SMOTE) for optimised synthetic prediction modelling. This approach generates additional synthetic instances from a limited training sample while concurrently seeking to improve best algorithm performance. As results show, the proposed Synthetic Instance Model Optimisation (SIMO) technique presents a viable, unified solution for finding optimum classifier performance when faced with sparse training resources. Using the SIMO approach, noticeable performance accuracy and f-measure improvements were achieved over standalone SMAC optimisation. Many results showed significant improvement when comparing collective training data with SIMO instance optimisation including individual performance accuracy increases of up to 46% and a mean overall increase for the entire 240 configurations of 13.96% over standard SMAC optimisation.

Paper Nr: 20
Title:

Nonlinear Feature Extraction using Multilayer Perceptron based Alternating Regression for Classification and Multiple-output Regression Problems

Authors:

Ozde Tiryaki and C. Okan Sakar

Abstract: Canonical Correlation Analysis (CCA) is a data analysis technique used to extract correlated features between two sets of variables. An important limitation of CCA is that it is a linear technique that cannot capture nonlinear relations in complex situations. To address this limitation, Kernel CCA (KCCA) has been proposed which is capable of identifying the nonlinear relations with the use of kernel trick. However, it has been shown that KCCA tends to overfit to the training set without proper regularization. Besides, KCCA is an unsupervised technique which does not utilize class labels for feature extraction. In this paper, we propose the nonlinear version of the discriminative alternating regression (D-AR) method to address these problems. While in linear D-AR two neural networks each with a linear bottleneck hidden layer are combined using alternating regression approach, the modified version of the linear D-AR proposed in this study has a nonlinear activation function in the hidden layers of the alternating multilayer perceptrons (MLP). Experimental results on a classification and a multiple-output regression problem with sigmoid and hyperbolic tangent activation functions show that features found by nonlinear D-AR from training examples accomplish significantly higher accuracy on test set than that of KCCA.

Paper Nr: 48
Title:

Concept Extraction with Convolutional Neural Networks

Authors:

Andreas Waldis, Luca Mazzola and Michael Kaufmann

Abstract: For knowledge management purposes, it would be interesting to classify and tag documents automatically based on their content. Concept extraction is one way of achieving this automatically by using statistical or semantic methods. Whereas index-based keyphrase extraction can extract relevant concepts for documents, the inverse document index grows exponentially with the number of words that candidate concpets can have. To adress this issue, the present work trains convolutional neural networks (CNNs) containing vertical and horizontal filters to learn how to decide whether an N-gram (i.e, a consecutive sequence of N characters or words) is a concept or not, from a training set with labeled examples. The classification training signal is derived from the Wikipedia corpus, knowing that an N-gram certainly represents a concept if a corresponding Wikipedia page title exists. The CNN input feature is the vector representation of each word, derived from a word embedding model; the output is the probability of an N-gram to represent a concept. Multiple configurations for vertical and horizontal filters were analyzed and configured through a hyper-parameterization process. The results demonstrated precision of between 60 and 80 percent on average. This precision decreased drastically as N increased. However, combined with a TF-IDF based relevance ranking, the top five N-gram concepts calculated for Wikipedia articles showed a high precision of 94%, similar to part-of-speech (POS) tagging for concept recognition combined with TF-IDF, but with a much better recall for higher N. CNN seems to prefer longer sequences of N-grams as identified concepts, and can also correctly identify sequences of words normally ignored by other methods. Furthermore, in contrast to POS filtering, the CNN method does not rely on predefined rules, and could thus provide language-independent concept extraction.

Paper Nr: 55
Title:

Identifying Electromyography Sensor Placement using Dense Neural Networks

Authors:

Paolo Cappellari, Robert Gaunt, Carl Beringer, Misagh Mansouri and Massimiliano Novelli

Abstract: Neural networks are increasingly being used in medical settings to support medical practitioners and researchers in performing their work. In the field of prosthetics for amputees, sensors can be used to monitor the activity of remaining muscle and ultimately control prosthetic limbs. In this work, we present an approach to identify the location of intramuscular electromyograph sensors percutaneously implanted in extrinsic muscles of the forearm controlling the fingers and wrist during single digit movements. A major challenge is to confirm whether each sensor is placed in the targeted muscle, as this information can be critical in developing and implementing control systems for prosthetic limbs. We propose an automated approach, based on artificial neural networks, to identify the correct placement of an individual sensor. Our approach can provide feedback on each placed sensor, so researchers can validate the source of each signal before performing their data analysis.

Paper Nr: 63
Title:

Dow Jones Trading with Deep Learning: The Unreasonable Effectiveness of Recurrent Neural Networks

Authors:

Mirco Fabbri and Gianluca Moro

Abstract: Though recurrent neural networks (RNN) outperform traditional machine learning algorithms in the detection of long-term dependencies among the training instances, such as in term sequences in sentences or among values in time series, surprisingly few studies so far have deployed concrete solutions with RNNs for the stock market trading. Presumably the current difficulties of training RNNs have contributed to discourage their wide adoption.This work presents a simple but effective solution, based on a deep RNN, whose gains in trading with Dow Jones Industrial Average (DJIA) outperform the state-of-the-art, moreover the gain is 50% higher than that produced by similar feed forward deep neural networks. The trading actions are driven by the predictions of the price movements of DJIA, using simply its publicly available historical series. To improve the reliability of results with respect to the literature, we have experimented the approach on a long consecutive period of 18 years of historical DJIA series, from 2000 to 2017. In 8 years of trading in the test set period from 2009 to 2017, the solution has quintupled the initial capital, moreover since DJIA has on average an increasing trend, we also tested the approach with a decreasing averagely trend by simply inverting the same historical series of DJIA. In this extreme case, in which hardly any investor would risk money, the approach has more than doubled the initial capital.

Short Papers
Paper Nr: 5
Title:

Framework for the Recognition of Activities of Daily Living and Their Environments in the Development of a Personal Digital Life Coach

Authors:

Ivan Miguel Pires, Nuno M. Garcia, Nuno Pombo and Francisco Flórez-Revuelta

Abstract: Due to the commodity of the use of the off-the-shelf mobile devices and technological devices by ageing people, the automatic recognition of the Activities of Daily Living (ADL) and their environments using these devices is a research topic were studied in the last years, but this project consists in the creation of an automatic method that recognizes a defined dataset of ADL using a large set of sensors available in these devices, such as the accelerometer, the gyroscope, the magnetometer, the microphone and the Global Positioning System (GPS) receiver. The fusion of the data acquired from the selected sensors allows the recognition of an increasing number of ADL and environments, where the ADL are mainly recognized with motion, magnetic and location sensors, but the environments are mainly recognized with acoustic sensors. During this project, several methods have been researched in the literature, implementing three types of neural networks, these are Multilayer Perceptron (MLP) with Backpropagation, Feedforward neural network (FNN) with Backpropagation and Deep Neural Networks (DNN), verifying that the neural networks that report highest results are the DNN method for the recognition of ADL and standing activities, and the FNN method for the recognition of environments.

Paper Nr: 38
Title:

Distributed Optimization of Classifier Committee Hyperparameters

Authors:

Sanzhar Aubakirov, Paulo Trigo and Darhan Ahmed-Zaki

Abstract: In this paper, we propose an optimization workflow to predict classifiers accuracy based on the exploration of the space composed of different data features and the configurations of the classification algorithms. The overall process is described considering the text classification problem. We take three main features that affect text classification and therefore the accuracy of classifiers. The first feature considers the words that comprise the inputtext; here we use the N-gram concept with different N values. The second feature considers the adoption of textual pre-processing steps such as the stop-word filtering and stemming techniques. The third feature considers the classification algorithms hyperparameters. In this paper, we take the well-known classifiers K-Nearest Neighbors (KNN) and Naive Bayes (NB) where K (from KNN) and a-priori probabilities (from NB) are hyperparameters that influence accuracy. As a result, we explore the feature space (correlation among textual and classifier aspects) and we present an approximation model that is able to predict classifiers accuracy.

Posters
Paper Nr: 4
Title:

An Approach to Use Deep Learning to Automatically Recognize Team Tactics in Team Ball Games

Authors:

Friedemann Schwenkreis

Abstract: Deep Learning methods are used successfully in pattern recognition areas like face or voice recognition. However, the recognition of sequences of images for automatically recognizing tactical movements in team sports is still an unsolved area. This paper introduces an approach to solve this class of problems by mapping the sequence problem onto the classical shape recognition problem in case of pictures. Using team handball as an example, the paper first introduces the underlying data collection approach and a corresponding data model before introducing the actual mapping onto classical deep learning approaches. Team handball is just used as an example sport to illustrate the concept, which can be applied to any team ball game in which coordinated team moves are used.

Paper Nr: 57
Title:

Benchmarking Auto-WEKA on a Commodity Machine

Authors:

João Freitas, Nuno Lavado and Jorge Bernardino

Abstract: Machine Learning model building is an important and complex task in Data Science but also a good target for automation as recently exploited by AutoML. In general, free and open-source packages offer a joint space of learning algorithms and their respective hyperparameter settings and an optimization method for model search and tuning. In this paper, Auto-WEKA’s performance has been tested by running it for short periods of time (5, 15 and 30 minutes) using a commodity machine and suitable datasets with a limited number of observations and features. Benchmarking was performed against the best human-generated solution available in OpenML for each selected dataset. We concluded that increasing the overall time budget available over the previous values didn’t improve significantly classifiers’ performance.

Paper Nr: 67
Title:

Predicting Flare Probability in Rheumatoid Arthritis using Machine Learning Methods

Authors:

Asmir Vodenčarević, Marlies C. van der Goes, O’Jay A. G. Medina, Mark C. H. de Groot, Saskia Haitjema, Wouter W. van Solinge, Imo E. Hoefer, Linda M. Peelen, Jacob M. van Laar, Marcus Zimmermann-Rittereiser, Bob C. Hamans and Paco M. J. Welsing

Abstract: Rheumatoid Arthritis (RA) is a chronic inflammatory disease that mostly affects joints. It requires life-long treatment aiming at suppression of disease activity. RA is characterized by periods of low or even absent activity of the disease (“remission”) alternated with exacerbations of the disease (“flares”) leading to pain, functional limitations and decreased quality of life. Flares and periods of high disease activity can lead to joint damage and permanent disability. Over the last decades treatment of RA patients has improved, especially with the new “biological” drugs. This expensive medication also carries a risk of serious adverse events such as severe infections. Therefore patients and physicians often wish to taper the dose or even stop the drug, once stable remission is reached. Unfortunately, drug tapering is associated with the increased risk of flares. In this paper we applied machine learning methods on the Utrecht Patient Oriented Database (UPOD) to predict flare probability within a time horizon of three months. Providing information about flare probability for different dose reduction scenarios would enable clinicians to perform informed tapering which may prevent flares, reduce adverse events and save drug costs. Our best models can predict flares with AUC values of about 80%.

Area 4 - Data Management and Quality

Full Papers
Paper Nr: 16
Title:

Seamless Database Evolution for Cloud Applications

Authors:

Aniket Mohapatra, Kai Herrmann, Hannes Voigt, Simon Lüders, Tsvetan Tsokov and Wolfgang Lehner

Abstract: We present DECA (Database Evolution in Cloud Applications), a framework that facilitates fast, robust, and agile database evolution in cloud applications. Successful business requires to react to changing wishes and requirements of customers instantly. In today’s realities, this often boils down to adding new features or fixing bugs in a software system. We focus on cloud application platforms that allow seamless development and deployment of applications by operating both the old and the new version in parallel for the time of development/deployment. This does not work if a common database is involved, since they cannot co-exist in multiple versions. To overcome this limitation, we apply current advances in the field of agile database evolution. DECA equips developers with an intuitive Database Evolution Language to create a new co-existing schema version for development and testing. Meanwhile, users can continuously use the old version. With the click of a button, we migrate the database to the new version and move all the users without unpredictable downtime and without the risk of corrupting our data. So, DECA speeds up the evolution of information systems to the pace of modern business.

Paper Nr: 23
Title:

Knowledge at First Glance: A Model for a Data Visualization Recommender System Suited for Non-expert Users

Authors:

Petra Kubernátová, Magda Friedjungová and Max van Duijn

Abstract: In today’s age, there are huge amounts of data being generated every second of every day. Through data visualization, humans can explore, analyse and present it. Choosing a suitable visualization for data is a difficult task, especially for non-experts. Current data visualization recommender systems exist to aid in choosing a visualization, yet suffer from issues such as low accessibility and indecisiveness. The aim of this study is to create a model for a data visualization recommender system for non-experts that resolves these issues. Based on existing work and a survey among data scientists, requirements for a new model were identified and implemented. The result is a question-based model that uses a decision tree and a data visualization classification hierarchy in order to recommend a visualization. Furthermore, it incorporates both task-driven and data characteristics-driven perspectives, whereas existing solutions seem to either convolute these or focus on one of the two exclusively. Based on testing against existing solutions, it is shown that the new model reaches similar results while being simpler, clearer, more versatile, extendable and transparent. The presented model can be applied in the development of new data visualization software or as part of a learning tool.

Paper Nr: 31
Title:

Enhancing Open Data Knowledge by Extracting Tabular Data from Text Images

Authors:

Andrei Puha, Octavian Rinciog and Vlad Posea

Abstract: Open data published by public institutions are one of the most important resources available online. Using this public information, decision makers can improve the lives of citizens. Unfortunately, most of the times these open data are published as files, some of them not being easily processable such as scanned pdf files. In this paper we present an algorithm which enhances nowadays knowledge by extracting tabular data from scanned pdf documents in an efficient way. The proposed workflow consists of several distinct steps: first the pdf documents are converted into images, subsequently images are preprocessed using specific processing techniques. The final steps imply running an adaptive binarization of the images, recognizing the structure of the tables, applying Optical Character Recognition (OCR) on each cell of the detected tables and exporting them as csv. After testing the proposed method on several low quality scanned pdf documents, it turned out that our methodology performs alike dedicated OCR paid software and we have integrated this algorithm as a service in our platform that converts open data in Linked Open Data.

Short Papers
Paper Nr: 22
Title:

An Efficient Decentralized Multidimensional Data Index: A Proposal

Authors:

Francesco Gargiulo, Antonio Picariello and Vincenzo Moscato

Abstract: The main objective of this work is the proposal of a decentralized data structure storing a large amount of data under the assumption that it is not possible or convenient to use a single workstation to host all data. The index is distributed over a computer network and the performance of the search, insert, delete operations are close to the traditional indices that use a single workstation. It is based on k-d trees and it is distributed across a network of "peers", where each one hosts a part of the tree and uses message passing for communication between peers. In particular, we propose a novel version of the k-nearest neighbour algorithm that starts the query in a randomly chosen peer and terminates the query as soon as possible. Preliminary experiments have demonstrated that in about 65% of cases it starts a query in a random peer that does not involve the peer containing the root of the tree and in the 98% of cases it terminates the query in a peer that does not contain the root of the tree.

Paper Nr: 56
Title:

Beyond Data Quality: Data Excellence Challenges from an Enterprise, Research and City Perspective

Authors:

Johannes Sautter, Rebecca Litauer, Rudolf Fischer, Tina Klages, Andrea Wuchner, Elena Müller, Gretel Schaj, Ekaterina Dobrokhotova, Patrick Drews and Stefan Riess

Abstract: Researchers and practitioners widely agree on data quality as one of the major goals of data management. However, data management departments in enterprises and organisations increasingly realise needs for data availability, compliance, operational excellence with regard to the domain and other data-challenges. In raised case studies in the enterprise, research and city domain, challenges regarding data availability, operational integration, compliance and quality of data management processes are analysed. Based on the concept of data quality, this paper argues for a similar concept with a broader scope for assessing an organisation’s data suitability. Based on literature and case studies this paper proposes a definition of the term data excellence as the capability of an organisation to reach its operational goals by ensuring the availability and integration of suitable, transparent and compliant high quality data.

Posters
Paper Nr: 53
Title:

Sensor-based Database with SensLog: A Case Study of SQL to NoSQL Migration

Authors:

Prasoon Dadhich, Andrey Sadovykh, Alessandra Bagnato, Michal Kepka, Ondřej Kaas and Karel Charvát

Abstract: Sensors gained a significant role in the Internet of Things (IoT) applications in various industry sectors. The information retrieved from the sensors are generally stored in the database for post-processing and analysis. This sensor database could grow rapidly when the data is frequently collected by several sensors altogether. It is thus often required to scale databases as the volume of data increases dramatically. Cloud computing and new database technologies has become key technologies to solve these problems. Traditionally relational SQL databases are widely used and have proved reliable over time. However, the scalability of SQL databases at large scale has always been an issue. With the ever-growing data volumes, various new database technologies have appeared which proposes performance and scalability gains under severe conditions. They have often named as NoSQL databases as opposed to SQL databases. One of the challenges that have arisen is knowing how and when to migrate existing relational databases to NoSQL databases for performance and scalability. In the current paper, we present a work in progress with the DataBio project for the SensLog application case study with some initial success. We will report on the ideas and the migration approach of SensLog platform and the performance benchmarking.

Paper Nr: 64
Title:

Missed Calls Encoding Technology for GPS Data Asset Circulation

Authors:

Chengcheng Liu, Miaoqiong Wang, Pengwei Ma, Kai Wei, Chunyu Jiang and Shu Yan

Abstract: With the rapid development in data asset management in big data, the encrypted data circulation become more and more important so as to guarantee the value and compliance for data asset. There are more and more applications for encrypted data transmission. People’s trajectory data is not only privacy, but also the focus of supervision, and even effect people’s life safety. Thus, transmitting information in such areas is a problem. This paper proposes a four-layer framework based on the missed calls encoding technology that utilizes phone signal base stations to transmit Global Position System (GPS) information without user perception. To verify the effectiveness of the method, this paper applied the location information transmission framework and used clever encoding ways to compress information to reduce the number of calls and time delays.

Area 5 - Databases and Data Security

Full Papers
Paper Nr: 7
Title:

Indexing Patterns in Graph Databases

Authors:

Jaroslav Pokorný, Michal Valenta and Martin Troup

Abstract: Nowadays graphs have become very popular in domains like social media analytics, healthcare, natural sciences, BI, networking, graph-based bibliographic IR, etc. Graph databases (GDB) allow simple and rapid retrieval of complex graph structures that are difficult to model in traditional IS based on a relational DBMS. GDB are designed to exploit relationships in data, which means they can uncover patterns difficult to detect using traditional methods. We introduce a new method for indexing graph patterns within a GDB modelled as a labelled property graph. The index is organized in a tree structure and stored in the same database where the database graph. The method is analysed and implemented for Neo4j GDB engine. It enables to create, use and update indexes that are used to speed-up the process of matching graph patterns. The paper provides a comparison between queries with and without using indexes.

Paper Nr: 25
Title:

Architectural Considerations for a Data Access Marketplace

Authors:

Uwe Hohenstein, Sonja Zillner and Andreas Biesdorf

Abstract: Data and data access are increasingly becoming a good to sell. This paper suggests a marketplace for data access applications where producers can offer data (access) and algorithms, while consumers can subscribe to both and use them. In particular, fine-grained controlled data access can be sold to several people with different Service Level Agreements (SLAs) and prices. A general architecture is proposed which is based upon the API management tool WSO2 to ease implementation and reduce effort. Indeed, API Management provides many features that are useful in this context, but also unveil some challenges. A deeper discussion explains the technical challenges and alternatives to combine API Management with user-specific filtering and control of SLAs.

Paper Nr: 29
Title:

Analysis of Data Structures Involved in RPQ Evaluation

Authors:

Frank Tetzel, Hannes Voigt, Marcus Paradies, Romans Kasperovics and Wolfgang Lehner

Abstract: A fundamental ingredient of declarative graph query languages are regular path queries (RPQs). They provide an expressive yet compact way to match long and complex paths in a data graph by utilizing regular expressions. In this paper, we systematically explore and analyze the design space for the data structures involved in automaton-based RPQ evaluation. We consider three fundamental data structures used during RPQ processing: adjacency lists for quick neighborhood exploration, visited data structure for cycle detection, and the representation of intermediate results. We conduct an extensive experimental evaluation on realistic graph data sets and systematically investigate various alternative data structure representations and implementation variants. We show that carefully crafted data structures which exploit the access pattern of RPQs lead to reduced peak memory consumption and evaluation time.

Paper Nr: 47
Title:

Column Scan Optimization by Increasing Intra-Instruction Parallelism

Authors:

Nusrat Jahan Lisa, Annett Ungethüm, Dirk Habich, Nguyen Duy Anh Tuan, Akash Kumar and Wolfgang Lehner

Abstract: The key objective of database systems is to reliably manage data, whereby high query throughput and low query latency are core requirements. To satisfy these requirements for analytical query workloads, in-memory column store database systems are state-of-the-art. In these systems, relational tables are organized by column rather than by row, so that a full column scan is a fundamental key operation and thus, the optimization of the key operation is very crucial. For this reason, we investigated the optimization of a well-known scan technique using SIMD (Single Instruction Multiple Data) vectorization as well as using Field Programmable Gate Arrays (FPGA). In this paper, we present both optimization approaches with the goal to increase the intra-instruction execution parallelism to process more columns values in a single instruction simultaneously. For both, we present selective results of our exhaustive evaluation. Based on this evaluation, we draw some lessons learned for our ongoing research activities.

Posters
Paper Nr: 6
Title:

Algorithms for Computing Inequality Joins

Authors:

Brahma Dathan and Stefan Trausan Matu

Abstract: Although the problem of joins has been known ever since the concept of relational databases was introduced, much of the research in that area has addressed the question of equijoins. In this paper, we look at the problem of inequality joins, which compares attributes using operators other than equality. We develop an algorithm for computing inequality joins on two relations with comparisons on two pairs of attributes and then extend the work to queries involving more than two comparisons. Our work also derives a lower bound for inequality joins on two relations and show that the two-comparisons algorithm is optimal.

Paper Nr: 39
Title:

Reliable In-Memory Data Management on Unreliable Hardware

Authors:

Dirk Habich, Till Kolditz, Juliana Hildebrandt and Wolfgang Lehner

Abstract: The key objective of database systems is to reliably manage data, whereby high query throughput and low query latency are core requirements. To satisfy these requirements, database systems constantly adapt to novel hardware features. Although it has been intensively studied and commonly accepted that hardware error rates in terms of bit flips increase dramatically with the decrease of the underlying chip structures, most database system research activities neglected this fact, leaving error (bit flip) detection as well as correction to the underlying hardware. Especially for memory, silent data corruption (SDC) as a result of transient bit flips leading to faulty data is mainly detected and corrected at the DRAM and memory-controller layer. However, since future hardware becomes less reliable and error detection as well as correction by hardware becomes more expensive, this free ride will come to an end in the near future. To further provide a reliable data management, an emerging research direction will be employing specific and tailored protection techniques at the database system level. Following that, we are currently developing and implementing an adopted system design for state-of-the-art in-memory column stores. In this position paper, we summarize our vision, the current state and outline future work of our research.

Paper Nr: 54
Title:

Graph Databases Comparison: AllegroGraph, ArangoDB, InfiniteGraph, Neo4J, and OrientDB

Authors:

Diogo Fernandes and Jorge Bernardino

Abstract: Graph databases are a very powerful solution for storing and searching for data designed for data rich in relationships, such as Facebook and Twitter. With data multiplication and data type diversity there has been a need to create new storage and analysis platforms that structure irregular data with a flexible schema, maintaining a high level of performance and ensuring data scalability effectively, which is a problem that relational databases cannot handle. In this paper, we analyse the most popular graph databases: AllegroGraph, ArangoDB, InfiniteGraph, Neo4J and OrientDB. We study the most important features for a complete and effective application, such as flexible schema, query language, sharding and scalability.

Paper Nr: 58
Title:

Traversal-aware Encryption Adjustment for Graph Databases

Authors:

Nahla Aburawi, Frans Coenen and Alexei Lisitsa

Abstract: Data processing methods allowing to query encrypted data, such as CryptDB (Popa et al., 2011a) utilize multi-layered encryption and encryption adjustment in order to provide a reasonable trade-off between data security protection and data processing efficiency. In this paper, we consider querying of encrypted graph databases and propose a novel traversal-aware encryption adjustment scheme which trades efficiency for security. We show that by dynamically adjusting encryption layers as query execution progresses, we can correctly execute the query on the encrypted graph store revealing less information to the adversary than in the case of static adjustment done prior to execution.