DATA 2019 Abstracts


Area 1 - Big Data

Full Papers
Paper Nr: 19
Title:

Manifold Learning to Identify Consumer Profiles in Real Consumption Data

Authors:

Diego Perez, Marta Rivera-Alba and Alberto Sanchez-Carralero

Abstract: Precise and comprehensive analysis of individual consumption is key to marketers and policy makers. Traditionally, people’s consumption profiles have been approximated by household surveys. Although insightful and complete, household surveys suffer from some biases and inaccuracies. To compensate for some of those biases, we propose a new approach to compute and analyze consumer profiles based on millions of purchase transactions collected by a personal financial manager. Since this new kind of data sources requires new analysis methods, in this paper we propose the use of manifold learning techniques to visualize the whole data set at once, demonstrating how these techniques can cluster consumers in more meaningful groups than demographics alone. These unsupervised behavior-based clusters allow us to draw more educated hypotheses that we could otherwise miss. As an example, we will specifically discuss the characteristics of individuals with high housing and recreation consumption in our sample.

Paper Nr: 20
Title:

Predicting Depression Tendency based on Image, Text and Behavior Data from Instagram

Authors:

Yu C. Huang, Chieh-Feng Chiang and Arbee P. Chen

Abstract: Depression is common but serious mental disorder. It is classified as a mood disorder, which means that it is characterized by negative thoughts and emotions. With the development of Internet technology, more and more people post their life story and express their emotion on social media. Social media can provide a way to characterize and predict depression. It has been widely utilized by researchers to study mental health issues. However, most of the existing studies focus on textual data from social media. Few studies consider both text and image data. In this study, we aim to predict one’s depression tendency by analyzing image, text and behavior of his/her postings on Instagram. An effective mechanism is first employed to collect depressive and non-depressive user accounts. Next, three sets of features are extracted from image, text and behavior data to build the predictive deep learning model. We examine the potential for leveraging social media postings in understanding depression. Our experiment results demonstrate that the proposed model recognizes users who have depression tendency with an F-1 score of 82.3%. We are currently developing a tool based on this study for screening and detecting depression in an early stage.

Paper Nr: 23
Title:

A Community Detection Approach for Smart-Phone Addiction Recognition

Authors:

Fabio Cozzolino, Vincenzo Moscato, Antonio Picariello and Giancarlo Sperli

Abstract: In this paper, we present a novel approach for Smart-Phone Addiction recognition that leverages community detection algorithms from the Social Network Analysis (SNA) theory. Our basic idea is to model data concerning users’ behavior while they are using mobile devices as a particular social graph, discovering by means of SNA facilities patterns that better identify users with a high predisposition to smart phone addiction. Eventually, several experiments on a sample of users monitored for several weeks have been carried out to verify effectiveness of the proposed approach in correctly recognizing the related addiction degree.

Paper Nr: 84
Title:

XAI: A Middleware for Scalable AI

Authors:

Abdallah Salama, Alexander Linke, Igor P. Rocha and Carsten Binnig

Abstract: A major obstacle for the adoption of deep neural networks (DNNs) is that the training can take multiple hours or days even with modern GPUs. In order to speed-up training of modern DNNs, recent deep learning frameworks support the distribution of the training process across multiple machines in a cluster of nodes. However, even if existing well-established models such as AlexNet or GoogleNet are being used, it is still a challenging task for data scientists to scale-out distributed deep learning in their environments and on their hardware resources. In this paper, we present XAI, a middleware on top of existing deep learning frameworks such as MXNet and Tensorflow to easily scale-out distributed training of DNNs. The aim of XAI is that data scientists can use a simple interface to specify the model that needs to be trained and the resources available (e.g., number of machines, number of GPUs per machine, etc.). At the core of XAI, we have implemented a distributed optimizer that takes the model and the available cluster resources as input and finds a distributed setup of the training for the given model that best leverages the available resources. Our experiments show that XAI converges to a desired training accuracy 2x to 5x faster than default distribution setups in MXNet and TensorFlow.

Short Papers
Paper Nr: 26
Title:

An Evaluation of Big Data Architectures

Authors:

Valerie Garises and José G. Quenum

Abstract: In this paper, we present a novel evaluation of architectural patterns and software architecture analysis using Architecture-based Tradeoff Analysis Method (ATAM). To facilitate the evaluation, we classify the Big Data intrinsic characteristics into quality attributes. We also categorised existing architectures following architectural patterns. Overall, our evaluation clearly shows that no single architectural pattern is enough to guarantee all the required quality attributes. As such, we recommend a combination of more than one pattern. The net effect of this would be to increase the benefits of each architectural pattern and then support the design of Big Data software architectures with several quality attributes.

Paper Nr: 70
Title:

Less (Data) Is More: Why Small Data Holds the Key to the Future of Artificial Intelligence

Authors:

Ciro Greco, Andrea Polonioli and Jacopo Tagliabue

Abstract: The claims that big data holds the key to enterprise successes and that Artificial Intelligence (AI). is going to replace humanity have become increasingly more popular over the past few years, both in academia and in the industry. However, while these claims may indeed capture some truth, they have also been massively oversold, or so we contend here. The goal of this paper is two-fold. First, we provide a qualified defence of the value of less data within the context of AI. This is done by carefully reviewing two distinct problems for big data driven AI, namely a) the limited track record of Deep Learning (DL) in key areas such as Natural Language Processing (NLP), b) the regulatory and business significance of being able to learn from few data points. Second, we briefly sketch what we refer to as a case of “A.I. with humans and for humans”, namely an AI paradigm whereby the systems we build are privacy-oriented and focused on human-machine collaboration, not competition. Combining our claims above, we conclude that when seen through the lens of cognitively inspired A.I., the bright future of the discipline is about less data, not more, and more humans, not less.

Paper Nr: 72
Title:

Scaling Big Data Applications in Smart City with Coresets

Authors:

Le H. Trang, Hind Bangui, Mouzhi Ge and Barbora Buhnova

Abstract: With the development of Big Data applications in Smart Cities, various Big Data applications are proposed within the domain. These are however hard to test and prototype, since such prototyping requires big computing resources. In order to save the effort in building Big Data prototypes for Smart Cities, this paper proposes an enhanced sampling technique to obtain a coreset from Big Data while keeping the features of the Big Data, such as clustering structure and distribution density. In the proposed sampling method, for a given dataset and an ε>0, the method computes an ε-coreset of the dataset. The ε-coreset is then modified to obtain a sample set while ensuring the separation and balance in the set. Furthermore, by considering the representativeness of each sample point, our method can helps to remove noises and outliers. We believe that the coreset-based technique can be used to efficiently prototype and evaluate Big Data applications in the Smart City.

Paper Nr: 31
Title:

Research on Parallel Incremental Association Rule Algorithm based on Data Stream

Authors:

Zheng Hua, Tao Du, Shouning Qu and Tao Mi

Abstract: Data stream association rule is one of the most interesting problems in the data mining community. However, the bottleneck that the computational power of a single computer is limited and the number of candidate itemsets far surpasses memory space cannot be solved by the traditional algorithm. This paper focuses on this problem and introduces a novel algorithm named a Parallel Incremental Association Rule Mining based on a Hierarchical method (PIMH), which utilizes the Spark parallel computing platform as a framework solve the problem of parallel processing the partition data. In the PIMH algorithm, the existing mining results is used to solve the repeated mining inefficiency, consequently, association rules can be quickly updated. Additionally, the local pruning method is performed which avoids the problem that the memory is difficult to stored. According to the experimental results, the proposed algorithm showed to be the characteristics of high accuracy and spend less computational time in comparison with existing data stream algorithms.

Paper Nr: 59
Title:

Road Operations Orchestration Enhanced with Long-short-term Memory and Machine Learning (Position Paper)

Authors:

Fuji Foo, Poh J. Peng, Robert K. Lin and Wenwey Hseush

Abstract: Road traffic management has been a priority for urban city planners to mitigate urban traffic congestion. In 2018, the economic impact to US due to lost productivity of workers sitting in traffic, increased cost of transporting goods through congested areas, and all of that wasted fuel amounted to US$87 billion, an average of US$1,348 per driver. In land scare Singapore, congestion not only translates to economic impact, but also strain to the infrastructure and city land use. While techniques for traffic prediction have existed for many years, the research effort has mainly been focused on traffic prediction. The downstream impact on how city administration should predict and react to incidents and/or events has not been widely discussed. In this paper, we propose Artificial Intelligence enabled Complex Event Processing to only identify and predict incidents, but also to enable a swift response through effective deployment of critical resources to ensure well-coordinated recovery action before any incident develop into crisis.

Paper Nr: 71
Title:

Semi-Structured Data Model for Big Data (SS-DMBD)

Authors:

Shady Hamouda and Zurinahni Zainol

Abstract: New business applications require flexibility in data model structure and must support the next generation of web applications and handle complex data types. The performance of processing structured data through a relational database has become incompatible with big data challenges. Nowadays, there is a need to deal with semi-structured data with a flexible schema for different applications. Not only SQL (NoSQL) has been presented to overcome the limitations of relational databases in terms of scale, performance, data model, and distribution system. Also, NoSQL supports semi-structured data and can handle a huge amount of data and provide flexibility in the data schema. But the data models of NoSQL systems are very complex, as there are no tools available to represent a scheme for NoSQL databases. In addition, there is no standard schema for data modelling of document-oriented databases. This study proposes a semi-structured data model for big data (SS-DMBD) that is compatible with a document-oriented database, and also proposes an algorithm for mapping the entity relationship (ER) model to SS-DMBD. A case study is used to evaluate the SS-DMBD and its features. The results show that this model can address most features of semi-structured data.

Paper Nr: 73
Title:

Development of Spatial Quality Control Method for Temperature Observation Data using Cluster Analysis

Authors:

Yunha Kim, Nooree Min, Hannah Lee, Mi-Lim Ou, Sanghyeon Jeon and Myung-jin Hyun

Abstract: In the National Climate Data Center of Korea Meteorological Administration, quality control methods of meteorological observations are applied to identify erroneous observation values. The type of quality control methods we have been using is to check the value from one station, either instantly or temporally. The spatial checking methods that find errors by comparing values of several stations at a time are difficult to apply because calculating the threshold using a large amount of observations is time consuming and various conditions for applying are required. In this study, we develop a new spatial checking method for temperature observation data using cluster analysis for meteorological observations that can be performed fast and effective in clarifying errors that have not been discovered.

Area 2 - Business Analytics

Full Papers
Paper Nr: 10
Title:

Using Topic Specific Features for Argument Stance Recognition

Authors:

Tobias Eljasik-Swoboda, Felix Engel and Matthias Hemmje

Abstract: Argument detection and its representation through ontologies are important parts of today’s attempt in automated recognition and processing of useful information in the vast amount of constantly produced data. However, due to the highly complex nature of an argument and its characteristics, its automated recognition is hard to implement. Given this overall challenge, as part of the objectives of the RecomRatio project, we are interested in the traceable, automated stance detection of arguments, to enable the construction of explainable pro/con argument ontologies. In our research, we design and evaluate an explainable machine learning based classifier, trained on two publicly available data sets. The evaluation results proved that explainable argument stance recognition is possible with up to .96 F1 when working within the same set of topics and .6 F1 when working with entirely different topics. This informed our hypothesis, that there are two sets of features in argument stance recognition: General features and topic specific features.

Paper Nr: 33
Title:

Comparative Analysis of Store Clustering Techniques in the Retail Industry

Authors:

Kanika Agarwal, Prateek Jain and Mamta Rajnayak

Abstract: Many offline retailers in European Markets are currently exploring different store designs to address local demands and to gain a competitive edge. There has been a significant demand in this industry to use analytics as a key pillar to take store-centric informed strategic decisions. The main objective of this case study is to propose a robust store clustering mechanism which will help the business to understand their stores better and frame store-centric marketing strategies with an aim to maximize their revenues. This paper evaluates four advance analytics-based clustering techniques namely: Hierarchical clustering, Self Organizing Maps, Gaussian Mixture Matrix, and Fuzzy C-means These techniques are used for clustering offline stores of a global retailer across four European markets. The results from these four techniques are compared and presented in this paper.

Paper Nr: 35
Title:

Data Mining using Morlet Wavelets for Financial Time Series

Authors:

Reginald Bolman and Thomas Boucher

Abstract: Wavelets are a family of signal processing techniques which have a growing popularity in the artificial intelligence community. In particular, Morlet wavelets have been applied to neural network time series trend prediction, forecasting the effects of monetary policy, etc. In this paper, we discuss the application of Morlet wavelets to discover the morphology of a time series cyclical components and the unsupervised data mining of financial time series in order to discover hidden motifs within the data. To perform the analysis of a given time series and form a comparison between the morphologies this paper proposes the implementation of the “Bolman Time Series Power Comparison” algorithm which will extract the pertinent time series motifs from the underlying dataset.

Paper Nr: 39
Title:

Exploring Robustness in a Combined Feature Selection Approach

Authors:

Alexander Wurl, Andreas Falkner, Alois Haselböck, Alexandra Mazak and Peter Filzmoser

Abstract: A crucial task in the bidding phase of industrial systems is a precise prediction of the number of hardware components of specific types for the proposal of a future project. Linear regression models, trained on data of past projects, are efficient in supporting such decisions. The number of features used by these regression models should be as small as possible, so that determining their quantities generates minimal effort. The fact that training data are often ambiguous, incomplete, and contain outlier makes challenging demands on the robustness of the feature selection methods used. We present a combined feature selection approach: (i) iteratively learn a robust well-fitted statistical model and rule out irrelevant features, (ii) perform redundancy analysis to rule out dispensable features. In a case study from the domain of hardware management in Rail Automation we show that this approach assures robustness in the calculation of hardware components.

Paper Nr: 68
Title:

Farm Detection based on Deep Convolutional Neural Nets and Semi-supervised Green Texture Detection using VIS-NIR Satellite Image

Authors:

Sara Sharifzadeh, Jagati Tata and Bo Tan

Abstract: Farm detection using low resolution satellite images is an important topic in digital agriculture. However, it has not received enough attention compared to high-resolution images. Although high resolution images are more efficient for detection of land cover components, the analysis of low-resolution images are yet important due to the low-resolution repositories of the past satellite images used for timeseries analysis, free availability and economic concerns. The current paper addresses the problem of farm detection using low resolution satellite images. In digital agriculture, farm detection has significant role for key applications such as crop yield monitoring. Two main categories of object detection strategies are studied and compared in this paper; First, a two-step semi-supervised methodology is developed using traditional manual feature extraction and modelling techniques; the developed methodology uses the Normalized Difference Moisture Index (NDMI), Grey Level Co-occurrence Matrix (GLCM), 2-D Discrete Cosine Transform (DCT) and morphological features and Support Vector Machine (SVM) for classifier modelling. In the second strategy, high-level features learnt from the massive filter banks of deep Convolutional Neural Networks (CNNs) are utilised. Transfer learning strategies are employed for pretrained Visual Geometry Group Network (VGG-16) networks. Results show the superiority of the high-level features for classification of farm regions.

Short Papers
Paper Nr: 6
Title:

BIpm: Combining BI and Process Mining

Authors:

Mohammad Reza Harati Nik, Wil P. van der Aalst and Mohammadreza Fani Sani

Abstract: In this paper, we introduce a custom visual for Microsoft Power BI that supports process mining and business intelligence analysis simultaneously using a single platform. This tool is called BIpm, and it brings the simple, agile, user-friendly, and affordable solution to study process models over multidimensional event logs. The Power BI environment provides many self-service BI and OLAP features that can be exploited through our custom visual aimed at the analysis of process data. The resulting toolset allows for accessing various input data sources and generating online reports and dashboards. Rather than designing and working with reports in the Power BI service on the web, it can be possible to view them in the Power BI mobile apps, and this means BIpm provides a solution to have process mining visualizations on mobiles. Therefore, BIpm can encourage many businesses and organizations to do process mining analysis with business intelligence analytics. Consequently, it yields managers and decision makers to translate discovered insights comprehensively to gain improved decisions and better performance more quickly.

Paper Nr: 34
Title:

A Graded Concept of an Information Model for Evaluating Performance in Team Handball

Authors:

Friedemann Schwenkreis

Abstract: Although team handball is a very popular sport in Europe, computer science did almost completely ignore that area in the past. This article introduces a graded approach for an information model that allows to express the effectiveness of a team as well as of single players, thus providing a basis for information-based decisions of coaches, as well as for applying analytical methods. From this perspective, the article is an early step to further introduce digitalization via a data model into a very classical sports area, however, introducing mechanisms that take into account the available degree of digitalization.

Paper Nr: 40
Title:

Detection of e-Commerce Anomalies using LSTM-recurrent Neural Networks

Authors:

Merih Bozbura, Hunkar C. Tunc, Miray E. Kusak and C. O. Sakar

Abstract: As the e-commerce sales grow in global retail sector year by year, detecting anomalies that occur in the most important key performance indicators (KPI) in real-time has become a critical requirement for e-commerce companies. Such anomalies that may arise from software updates, server failures, or incorrect price entries cause substantial revenue loss in the meantime until they are detected with their root-causes. In this paper, we present a comparative analysis of various anomaly detection methods in detecting e-commerce anomalies. For this purpose, we first present the univariate analysis of six commonly used anomaly detection methods on two important KPIs of an e-commerce website. The highest F1 Scores and recall values on the test sets of both KPIs are obtained using Long-Short Term Memory (LSTM) network, showing that LSTM fits better to the dynamics of e-commerce KPIs than time-series based prediction methods. Then, in addition to the univariate analysis of the methods, we feed the campaign information into LSTM network considering that campaigns have significant effects on the values of KPIs in e-commerce domain and this information can be helpful to prevent false positives that may occur in the campaign periods. The results also show that constructing a multivariate LSTM by feeding the campaign information as an additional input improves the adaptability of the model to sudden changes occurring in campaign periods.

Paper Nr: 42
Title:

Approaches to Identify Relevant Process Variables in Injection Moulding using Beta Regression and SVM

Authors:

Shailesh Tripathi, Sonja Strasser, Christian Mittermayr, Matthias Dehmer and Herbert Jodlbauer

Abstract: In this paper, we analyze data from an injection moulding process to identify key process variables which influence the quality of the production output. The available data from the injection moulding machines provide information about the run-time, setup parameters of the machines and the measurements of different process variables through sensors. Additionally, we have data about the total output produced and the number of scrap parts. In the first step of the analysis, we preprocessed the data by combining the different sets of data for a whole process. Then we extracted different features, which we used as input variables for modeling the scrap rate. For the predictive modeling, we employed three different models, beta regression with the backward selection, beta boosting with regularization and SVM regression with the radial kernel. All these models provide a set of common key features which affect the scrap rates.

Paper Nr: 46
Title:

A Data Mining Study on Pressure Ulcers

Authors:

Francisco Mota, Nuno Abreu, Tiago Guimarães and Manuel F. Santos

Abstract: Nurses follow well-defined guidelines in order to avoid the occurrence of pressure ulcers (pU) in patients under their care, not being always successful. This work intends to produce prediction models using Data Mining (DM) techniques in order to anticipate uP treatment. The work was conducted in the Oporto Hospital Center (CHP). For the construction of this DM study, the phases of the CRISP DM methodology were taken into account. In particular, the DM focus is to show that the time factor and frequency of interventions may influence the prediction of pU classification models. To prove this, we used a data set (containing 1339 records) where different classification techniques were applied using WEKA tool. Through the classification technique (decision tree), it was possible to create a guideline that contains all the scenarios and instructions that the professional can use in order to avoid patients to develop pU. For its construction we used the model that presented a higher percentage of sensitivity (number of positive cases correctly classified as "NO" developed pU). The conclusions were: the factors studied are good predictors of PU and the guideline obtained, through automatic techniques, can help professionals apply care to the patient more quickly.

Paper Nr: 54
Title:

Empowered by Innovation: Unravelling Determinants of Idea Implementation in Open Innovation Platforms

Authors:

Frederik Situmeang, Rob Loke, Nelleke de Boer and Danielle de Boer

Abstract: Companies use crowdsourcing to solve specific problems or to search for innovation. By using open innovation platforms, where community members propose ideas, companies can better serve customer needs. So far, it remains unclear which factors influence idea implementation in crowd sourcing context. With the research idea that we present here, we aim to get a better understanding of the success and failure of ideas by examining relationships between characteristics of ideators, characteristics of ideas and the likelihood of implementation. In order to test the methodological approach that we propose in this paper in which we investigate for business relevant innovativeness as well as sentiment based on text analytics, data including unstructured text was mined from Dell IdeaStorm using webcrawling and scraping techniques. Some relevant hypotheses that we define in this paper were confirmed on the Dell IdeaStorm dataset but in order to generalize our findings we want to apply to the Lego dataset in our current work in progress. Possible implications of our novel research idea can be used to fill theoretical gaps in marketing literature, help companies to better structure their search for innovation and for ideators to better understand factors contributing to successful idea generation.

Paper Nr: 63
Title:

Machine Learning Approach for National Innovation Performance Data Analysis

Authors:

Dominik Forner, Sercan Ozcan and David Bacon

Abstract: National innovation performance is essential for being economically competitive. The key determinants for its increase or decrease and the impact of governmental decisions or policy instruments are still not clear. Recent approaches are either limited due to qualitatively selected features or due to a small database with few observations. The aim of this paper is to propose a suitable machine learning approach for national innovation performance data analysis. We use clustering and correlation analysis, Bayesian Neural Network with Local Interpretable Model-Agnostic Explanations and BreakDown for decomposing innovation output prediction. Our results show, that the machine learning approach is appropriate to benchmark national innovation profiles, to identify key determinants on a cluster as well as on a national level whilst considering correlating features and long term effects and the impact of changes in innovation input (e.g. by governmental decision or innovation policy) on innovation output can be predicted and herewith the increase or decrease of national innovation performance.

Paper Nr: 65
Title:

On Bayes Factors for Success Rate A/B Testing

Authors:

Maciej Skorski

Abstract: This paper discusses Bayes factors, an alternative to classical frequentionist hypothesis testing, within the standard A/B proportion testing setup - observing outcomes of independent trails (which finds applications in industrial conversion testing). It is shown that the Bayes factor is controlled by the Jensen-Shannon divergence of success ratios in two tested groups, and the latter one is bounded (under mild conditions) by Welch’s t-statistic. The result implies an optimal bound on the necessary sample size for Bayesian testing, and demonstrates the relation to its frequentionist counterpart (effectively bridging Bayes factors and p-values).

Paper Nr: 78
Title:

Sentiment Analysis of German Emails: A Comparison of Two Approaches

Authors:

Bernd Markscheffel and Markus Haberzettl

Abstract: The increasing number of emails sent daily to the customer service of companies confronts them with new challenges. In particular, a lack of resources to deal with critical concerns, such as complaints, poses a threat to customer relations and the public perception of companies. Therefore, it is necessary to prioritize these concerns in order to avoid negative effects. Sentiment analysis, i.e. the automated recognition of the mood in texts, makes such prioritisation possible. The sentiment analysis of German-language emails is still an open research problem. Moreover, there is no evidence of a dominant approach in this context. Therefore two approaches are compared, which are applicable in the context of the problem definition mentioned. The first approach is based on the combination of sentiment lexicons and machine learning methods. This is to be extended by the second approach in such a way that in addition to the lexicons further features are used. These features are to be generated by the use of feature extraction methods. The methods used in both approaches are investigated in a systematic literature search. A Gold Standard corpus is generated as basis for the comparison of these approaches. Systematic experiments are carried out in which the different method combinations for the approaches are examined. The results of the experiments show that the combination of feature extracting methods with Sentiment lexicons and machine learning approaches generates the best classification results.

Paper Nr: 79
Title:

Data Analytics for Smart Manufacturing: A Case Study

Authors:

Nadeem Iftikhar, Thorkil Baattrup-Andersen, Finn E. Nordbjerg, Eugen Bobolea and Paul-Bogdan Radu

Abstract: Due to the emergence of the fourth industrial revolution, manufacturing business all over the world is changing dramatically; it needs enhanced efficiency, competency and productivity. More and more manufacturing machines are equipped with sensors and the sensors produce huge volume of data. Most of the companies do neither realize the value of data nor how to capitalize the data. The companies lack techniques and tools to collect, store, process and analyze the data. The objective of this paper is to propose data analytic techniques to analyze manufacturing data. The analytic techniques will provide both descriptive and predictive analysis. In addition, data from the company’s ERP system is integrated in the analysis. The proposed techniques will help the companies to improve operational efficiency and achieve competitive benefits.

Paper Nr: 14
Title:

How to Boost Customer Relationship Management via Web Mining Benefiting from the Glass Customer’s Openness

Authors:

Frederik S. Bäumer and Bianca Buff

Abstract: Customer Relationship Management refers to the consistent orientation of a company towards its customers. Since this requires customer-specific data sets, techniques such as web mining are used to acquire information about customers and their behavior. In this case study, we show how web mining can be used to automatically collect information from clients’ websites for Customer Relationship Management systems in Business-to-Business environments. Here, we use tailored local grammars to extract relevant information in order to build up a data set that meets the required high quality standards. The evaluation shows that local grammars produce substantial high-quality results, but turn out to be too rigid in some cases. In summary, our case study demonstrates that web mining in combination with local grammars is suitable for Business-to-Business CRM as long as the information demand can be defined precisely and the requested information is available online.

Paper Nr: 25
Title:

Case Study of Anomaly Detection and Quality Control of Energy Efficiency and Hygrothermal Comfort in Buildings

Authors:

Carlos Eiras-Franco, Miguel Flores, Verónica Bolón-Canedo, Sonia Zaragoza, Rubén Fernández-Casal, Salvador Naya and Javier Tarrío-Saavedra

Abstract: The aim of this work is to propose different statistical and machine learning methodologies for identifying anomalies and control the quality of energy efficiency and hygrothermal comfort in buildings. Companies focused on energy sector for buildings are interested on statistical and machine learning tools to automate the control of energy consumption and ensure quality of Heat Ventilation and Air Conditioning (HVAC) installations. Consequently, a methodology based on the application of the Local Correlation Integral (LOCI) anomaly detection technique has been proposed. In addition, the most critical variables for anomaly detection are identified by using ReliefF method. Once vectors of critical variables are obtained, multivariate and univariate control charts can be applied to control the quality of HVAC installations (consumption, thermal comfort). In order to test the proposed methodology, the companies involved in this project have provided the case study of a store of a clothing brand located in a shopping center in Panama. It is important to note that this is a controlled case study for which all the anomalies have been previously identified by maintenance personnel. Moreover, as an alternatively solution, in addition to machine learning and multivariate techniques, new nonparametric control charts for functional data based on data depth have been proposed and applied to curves of daily energy consumption in HVAC.

Paper Nr: 28
Title:

Productivity and Impact Analysis of a Research and Technology Development Center using Google Scholar Information

Authors:

Eric Garcia-Cano, Eugenio López-Ortega and Luis Alvarez-Icaza

Abstract: This paper presents a project aimed at evaluating the scientific productivity and impact of a Mexican research and technological development center. The proposed evaluation is based on an automated bibliometric analysis system that exploits Google Scholar (GS) information. The import process of the evaluation is shown, including different aspects such as information request times, data verification through a parallel query in Crossref and homogenization of publication sources. As a result, 8,492 documents by 137 researchers associated with the research center were identified. These documents have received 74,683 citations. GS includes a great variety of published materials, such as journal papers, books, conference proceedings, white papers, and technical reports. This diversity of documents allows for a broader evaluation that takes into consideration other types of research products that are not usually considered for assessing scientific productivity. From our work, we conclude that the information in GS can be used to conduct a formal analysis of the productivity and impact of a research center.

Paper Nr: 29
Title:

Security Issues of Scientific based Big Data Circulation Analysis

Authors:

Anastasia Andreasyan, Artem Balyakin, Marina Nurbina and Alina Mukhamedzhanova

Abstract: The paper deals with legal issues arising from the need to regulate Big Data. For the purpose of this study it is suggested that aspects of legal definition of Big Data should be considered, as well as its classification, and analysis of risks in the global experience of legal regulation of Big Data. The authors believe that in the context of onrush technology it is extremely important to strike the balance, protect sensitive data, and do not bar from technological development, taking into account socio-economic impact of Big Data technology. We stress, that it is more important to control the application of Big Data analysis, rather than the information itself used in data sets.

Paper Nr: 30
Title:

FOCA: A System for Classification, Digitalization and Information Retrieval of Trial Balance Documents

Authors:

Gokce Aydugan Baydar and Seçil Arslan

Abstract: Credit risk evaluation and sales target optimization are core businesses for financial institutions. Financial documents like t-balances, balance sheets, income statements are the most important inputs for both of these core businesses. T-balance is a semi-structured financial document which is constructed periodically by accountants and contains detailed accounting transactions. FOCA is an end to end system which first classifies financial documents in order to recognize t-balances, then digitalizes them into a tree-structured form and finally extracts valuable information such as bank names, human-company distinction, deposit type and liability term from free format text fields of t-balances. The information extracted is also enriched by matching human and company names who are in a relationship with existing customers of the bank from the customer database. Pattern recognition, natural language processing, and information retrieval techniques are utilized for these capabilities. FOCA supports both decision/operational processes of corporate/commercial/SME sales and financial analysis departments in order to empower new customer engagement, cross-sell and up-sell to the existing customers and ease financial analysis operations by digitalizing t-balances.

Paper Nr: 47
Title:

A Parallel Bit-map based Framework for Classification Algorithms

Authors:

Amila De Silva and Shehan Perera

Abstract: Bitmaps are gaining popularity with Data Mining Applications that use GPUs, since Memory organisation and the design of a GPU demands for regular & simple structures. However, absence of a common framework has limited the benefits of Bitmaps & GPUs mostly to Frequent Itemset Mining (FIM) algorithms. We in this paper, present a framework based on Bitmap techniques, that speeds up Classification Algorithms on GPUs. The proposed framework which uses both CPU and GPU for Algorithm execution, delegates compute intensive operations to GPU. We implement two Classification Algorithms Naíve Bayes and Decision Trees, using the framework, both which outperform CPU counterparts by several orders of magnitude.

Paper Nr: 50
Title:

Effective Training Methods for Automatic Musical Genre Classification

Authors:

Eren Atsiz, Erinç Albey and Enis Kayış

Abstract: Musical genres are labels created by human and based on mutual characteristics of songs, which are also called musical features. These features are key indicators for the content of the music. Rather than predictions by human decisions, developing an automatic solution for genre classification has been a significant issue over the last decade. In order to have automatic classification for songs, different approaches have been indicated by studying various datasets and part of songs. In this paper, we suggest an alternative genre classification method based on which part of songs have to be used to have a better accuracy level. Wide range of acoustic features are obtained at the end of the analysis and discussed whether using full versions or pieces of songs is better. Both alternatives are implemented and results are compared. The best accuracy level is 55% while considering the full version of songs. Besides, additional analysis for Turkish songs is also performed. All analysis, data, and results are visualized by a dynamic dashboard system, which is created specifically for the study.

Paper Nr: 76
Title:

Data Quality in Secondary Data Analysis: A Case Study of Ecological Data using a Semiotic-based Approach

Authors:

Mila Kwiatkowska and Frank Pouw

Abstract: Data quality problems are widespread in secondary data when they are used for data warehousing and data mining. This paper advocates a broad semiotic approach to data quality. The main premises of this expanded semiotic framework are (1) data represent some reality, (2) data are created and interpreted by humans in a communication process, (3) data are used for specific purposes by humans, and (4) data cannot be created, interpreted and used without knowledge. Thus, the semiotic-based approach to data quality in secondary data analysis has four aspects: (1) representational, (3) communicational, (3) pragmatic, and (4) knowledge-based. To illustrate these four characteristics, we present a case study of ecological data analysis used in the creation of an ornithological data warehouse. We discuss the temporal data (ecological notion of time), spatial ecological data (communication processes and protocols used for data collection), and bioacoustic data processing (domain knowledge needed for the specification of data provenance).

Area 3 - Soft Computing in Data Science

Full Papers
Paper Nr: 48
Title:

International Roaming Traffic Optimization with Call Quality

Authors:

Ahmet Şahin, Kenan C. Demirel, Erinc Albey and Gonca Gürsun

Abstract: In this study we focus on a Steering International Roaming Traffic (SIRT) problem with single service that concerns a telecommunication’s operators’ agreements with other operators in order to enable subscribers access services, without interruption, when they are out of operators’ coverage area. In these agreements, a subscriber’s call from abroad is steered to partner operator. The decision for which each call will be forwarded to the partner is based on the user’s location (country/city), price of the partner operator for that location and the service quality of partner operator. We develop an optimization model that considers agreement constraints and quality requirements while satisfying subscribers demand over a predetermined time interval. We test the performance of the proposed approach using different execution policies such as running the model once and fixing the roaming decisions over the planning interval or dynamically updating the decisions using a rolling horizon approach. We present a rigorous trade off analysis that aims to help the decision maker in assessing the relative importance of cost, quality and ease of implementation. Our results show that steering cost is decreased by approximately 25% and operator mistakes are avoided with the developed optimization model while the quality of the steered calls is kept above the base quality level.

Short Papers
Paper Nr: 49
Title:

Ensemble Learning based on Regressor Chains: A Case on Quality Prediction

Authors:

Kenan C. Demirel, Ahmet Şahin and Erinc Albey

Abstract: In this study we construct a prediction model, which utilizes the production process parameters acquired from a textile machine and predicts the quality characteristics of the final yarn. Several machine learning algorithms (decision tree, multivariate adaptive regression splines and random forest) are used for prediction. An ensemble method, using the idea of regressor chains, is developed to further improve the prediction performance. Collected data is first segmented into two parts (labeled as “normal” and “unusual”) using local outlier factor method, and performance of the algorithms are tested for each segment separately. It is seen that ensemble idea proves its competence especially for the cases where the collected data is categorized as unusual. In such cases ensemble algorithm improves the prediction accuracy significantly.

Paper Nr: 32
Title:

Improving Indoor Positioning via Machine Learning

Authors:

Aigerim Mussina and Sanzhar Aubakirov

Abstract: The problem of real time location system is of current interest. Cities are growing up and buildings become more complex and large. In this paper we will describe the indoor positioning issue on the example of user tracking, while using the Bluetooth Low Energy technology and received signal strength indicator(RSSI). We experimented and compared our simple hand-crafted rules with the following machine learning algorithms: Naive Bayes and Support Vector Machine. The goal was to identify actual position of active label among three possible statuses and achieve maximum accuracy. Finally, we achieved accuracy of 0.95.

Paper Nr: 41
Title:

An Unsupervised Drift Detector for Online Imbalanced Evolving Streams

Authors:

D. Himaja, T. M. Padmaja and P. R. Krishna

Abstract: Detecting concept drift from an imbalanced evolving stream is challenging task. At high degree of imbalance ratios, the poor or nil performance estimates of the learner from minority class tends to drift detection failures. To ameliorate this problem, we propose a new drift detection and adaption framework. Proposed drift detection mechanism is carried out in two phases includes unsupervised and supervised drift detection with queried labels. The adaption framework is based on the batch wise active learning. Comparative results on four synthetic and one real world balanced and imbalanced evolving streams with other prominent drift detection methods indicates that our approach is better in detecting the drift with low false positive rates.

Paper Nr: 57
Title:

Classification of Alzheimer’s Disease using Machine Learning Techniques

Authors:

Muhammad Shahbaz, Shahzad Ali, Aziz Guergachi, Aneeta Niazi and Amina Umer

Abstract: Alzheimer’s disease (AD) is a commonly known and widespread neurodegenerative disease which causes cognitive impairment. Although in medicine and healthcare areas, it is one of the frequently studied diseases of the nervous system despite that it has no cure or any way to slow or stop its progression. However, there are different options (drug or non-drug options) that may help to treat symptoms of the AD at its different stages to improve the patient’s quality of life. As the AD progresses with time, the patients at its different stages need to be treated differently. For that purpose, the early detection and classification of the stages of the AD can be very helpful for the treatment of symptoms of the disease. On the other hand, the use of computing resources in healthcare departments is continuously increasing and it is becoming the norm to record the patient’ data electronically that was traditionally recorded on paper-based forms. This yield increased access to a large number of electronic health records (EHRs). Machine learning, and data mining techniques can be applied to these EHRs to enhance the quality and productivity of medicine and healthcare centers. In this paper, six different machine learning and data mining algorithms including k-nearest neighbors (k-NN), decision tree (DT), rule induction, Naive Bayes, generalized linear model (GLM) and deep learning algorithm are applied on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset in order to classify the five different stages of the AD and to identify the most distinguishing attribute for each stage of the AD among ADNI dataset. The results of the study revealed that the GLM can efficiently classify the stages of the AD with an accuracy of 88.24% on the test dataset. The results also revealed these techniques can be successfully used in medicine and healthcare for the early detection and diagnosis of the disease.

Paper Nr: 82
Title:

LOS/NLOS Wireless Channel Identification based on Data Mining of UWB Signals

Authors:

Gianluca Moro, Roberto Pasolini and Davide Dardari

Abstract: Localisation algorithms based on the estimation of the time-of-arrival of the received signal are particularly interesting when ultra-wide band (UWB) signaling is adopted for high-definition location aware applications. In this context non-line-of-sight (NLOS) propagation condition may drastically degrade the localisation accuracy if not properly recognised. We propose a new NLOS identification technique based on the analysis of UWB signals through supervised and unsupervised machine learning algorithms, which are typically adopted to extract knowledge from data according to the data mining approach. Thanks to these algorithms we can automatically generate a very reliable model that recognises if an UWB received signal has crossed obstacles (NLOS situation). The main advantage of this solution is that it extracts the model for NLOS identification directly from example waveforms gathered in the environment and does not rely on empirical tuning of parameters as required by other NLOS identification algorithms. Moreover experiments show that accurate NLOS classifiers can be extracted from measured signals either pre-classified or unclassified and even from samples algorithmically-generated from statistical models, allowing the application of the method in real scenarios without training it on real data.

Area 4 - Data Management and Quality

Full Papers
Paper Nr: 21
Title:

Distributed and Scalable Platform for Collaborative Analysis of Massive Time Series Data Sets

Authors:

Eduardo Duarte, Diogo Gomes, David Campos and Rui L. Aguiar

Abstract: The recent expansion of metrification on a daily basis has led to the production of massive quantities of data, which in many cases correspond to time series. To streamline the discovery and sharing of meaningful information within time series, a multitude of analysis software tools were developed. However, these tools lack appropriate mechanisms to handle massive time series data sets and large quantities of simultaneous requests, as well as suitable visual representations for annotated data. We propose a distributed, scalable, secure and high-performant architecture that allows a group of researchers to curate a mutual knowledge base deployed over a network and to annotate patterns while preventing data loss from overlapping contributions or unsanctioned changes. Analysts can share annotation projects with peers over a reactive web interface with a customizable workspace. Annotations can express meaning not only over a segment of time but also over a subset of the series that coexist in the same segment. In order to reduce visual clutter and improve readability, we propose a novel visual encoding where annotations are rendered as arcs traced only over the affected curves. The performance of the prototype under different architectural approaches was benchmarked.

Short Papers
Paper Nr: 53
Title:

Assessing Technology Readiness for Artificial Intelligence and Machine Learning based Innovations

Authors:

Tobias Eljasik-Swoboda, Christian Rathgeber and Rainer Hasenauer

Abstract: Every innovation begins with an idea. To make this idea a valuable novelty worth investing in requires identification, assessment and management of innovation projects under two primary aspects: The Market Readiness Level (MRL) measures if there is actually a market willing to buy the envisioned product. The Technology Readiness Level (TRL) measures the capability to produce the product. The READINESSnavigator is a state of the art software tool that supports innovators and investors in managing these aspects of innovation projects. The existing technology readiness levels neatly model the production of physical goods but fall short in assessing data based products such as those based on Artificial Intelligence (AI) and Machine Learning (ML). In this paper we describe our extension of the READINESSnavigator with AI and ML relevant readiness levels and evaluate its usefulness in the context of 25 different AI projects.

Paper Nr: 61
Title:

A Reference Model for Product Data Profiling in Retail ERP Systems

Authors:

Rolf Krieger and Christian Schorr

Abstract: Due to the high volume of data and the increasing automation in retail, more and more companies are dealing with procedures to improve the quality of product data. A promising approach is the use of machine learning methods that support the user in master data management. The development of such procedures demands error-free training data. This means that product data must be cleaned and labelled which requires extensive data profiling. For typical retail company data bases with usually complex and convoluted structures this exploration step can take a huge and expensive amount of time. In order to speed up this process we present a reference model and best practices for the systematic and efficient profiling and exploration of product data.

Paper Nr: 38
Title:

A Conceptual Framework for Data Governance in IoT-enabled Digital IS Ecosystems

Authors:

Avirup Dasgupta, Asif Gill and Farookh Hussain

Abstract: There is a growing interest in the use of Internet of Things (IoT) in information systems (IS). Data or information governance is a critical component of IoT enabled digital IS ecosystem. There is insufficient guidance available on how to effectively establish data governance for IoT enabled digital IS ecosystem. The introduction of new regulations related to privacy such as General Data Protection Regulation (GDPR) as well as existing regulations such as Health Insurance Portability and Accountability Act (HIPPA) has added complexity to this issue of data governance. This could possibly hinder the effective IoT adoption in healthcare digital IS ecosystem. This paper enhances the 4I framework, which is iteratively developed and updated using the design science research (DSR) method to address this pressing need for organizations to have a robust governance model to provide the coverage across the entire data lifecycle in IoT-enabled digital IS ecosystem. The 4I framework has four major phases: Identify, Insulate, Inspect and Improve. The application of this framework is demonstrated with the help of a Healthcare case study. It is anticipated that the proposed framework can help the practitioners to identify, insulate, inspect and improve governance of data in IoT enabled digital IS ecosystem.

Paper Nr: 58
Title:

Application of Data Stream Processing Technologies in Industry 4.0: What is Missing?

Authors:

Guenter Hesse, Werner Sinzig, Christoph Matthies and Matthias Uflacker

Abstract: Industry 4.0 is becoming more and more important for manufacturers as the developments in the area of Internet of Things advance. Another technology gaining more attention is data stream processing systems. Although such streaming frameworks seem to be a natural fit for Industry 4.0 scenarios, their application in this context is still low. The contributions in this paper are threefold. Firstly, we present industry findings that we derived from site inspections with a focus on Industry 4.0. Moreover, our view on Industry 4.0 and important related aspects is elaborated. As a third contribution, we illustrate our opinion on why data stream processing technologies could act as an enabler for Industry 4.0 and point out possible obstacles on this way.

Paper Nr: 81
Title:

A Comparative Study for the Selection of Machine Learning Algorithms based on Descriptive Parameters

Authors:

Chettan Kumar, Martin Käppel, Nicolai Schützenmeier, Philipp Eisenhuth and Stefan Jablonski

Abstract: In this paper, we present a new cheat sheet based approach to select an adequate machine learning algorithm. However, we extend existing cheat sheet approaches at two ends. We incorporate two different perspectives towards the machine learning problem while simultaneously increasing the number of parameters decisively. For each family of machine learning algorithms (e.g. regression, classification, clustering, and association learning) we identify individual parameters that describe the machine learning problem accurately. We arrange those parameters in a table and assess known machine learning algorithms in such a table. Our cheat sheet is implemented as a web application based on the information of the presented tables.

Area 5 - Databases and Data Security

Short Papers
Paper Nr: 16
Title:

Creating Triggers with Trigger-By-Example in Graph Databases

Authors:

Kornelije Rabuzin and Martina Šestak

Abstract: In recent years, NoSQL graph databases have received an increased interest in the research community. Various query languages have been developed to enable users to interact with a graph database (e.g. Neo4j), such as Cypher or Gremlin. Although the syntax of graph query languages can be learned, inexperienced users may encounter learning difficulties regardless of their domain knowledge or level of expertise. For this reason, the Query-By-Example approach has been used in relational databases over the years. In this paper, we demonstrate how a variation of this approach, the Trigger-By-Example approach, can be used to define triggers in graph databases, specifically Neo4j, as database mechanisms activated upon a given event. The proposed approach follows the Event-Condition-Action model of active databases, which represents the basis of a trigger. To demonstrate the proposed approach, a special graphical interface has been developed, which enables users to create triggers in a short series of steps. The proposed approach is tested on several sample scenarios.

Paper Nr: 43
Title:

Database Performance Comparisons: An Inspection of Fairness

Authors:

Uwe Hohenstein and Martin Jergler

Abstract: Special benchmarks and performance comparisons have been published to analyze and stress the outstanding performance of new database technologies. Quite often, the comparisons show that newly upcoming database technologies provide higher performance than traditional relational ones. In this paper, we show that these performance comparisons are not always meaningful and should not encourage one to jump to fast conclusions. We revisit certain statements about comparisons between the Neo4j graph database and relational systems and indicate a couple of possible reasons for coming up with bad performance such as inappropriate or default configurations, and too straightforward implementations. Moreover, we refute some stated issues about the bad performance of relational systems by using a PostgreSQL database for commonly used test scenarios. We conclude with some considerations of fairness.

Paper Nr: 75
Title:

The Quest for the Appropriate Cyber-threat Intelligence Sharing Platform

Authors:

Thanasis Chantzios, Paris Koloveas, Spiros Skiadopoulos, Nikos Kolokotronis, Christos Tryfonopoulos, Vasiliki-Georgia Bilali and Dimitris Kavallieros

Abstract: Cyber-threat intelligence (CTI) is any information that can help an organization identify, assess, monitor, and respond to cyber-threats. It relates to all cyber components of an organization such as networks, computers, and other types of information technology. In the recent years, due to the major increase of cyber-threats, CTI sharing is becoming increasingly important both as a subject of research and as a concept of providing additional security to organizations. However, selecting the proper tools and platforms for CTI sharing, is a challenging task, that pertains to a variety of aspects. In this paper, we start by overviewing the CTI procedure (threat types, categories, sources and the general CTI life-cycle). Then, we present a set of seven high-level CTI plaftorm recommendations that can be used to evaluate a platform and subsequently we survey six state-of-the-art cyber-threat intelligence platforms. Finally, we compare and evaluate the six aforementioned platforms by means of the earlier proposed recommendations.

Paper Nr: 80
Title:

Trading Memory versus Workload Overhead in Graph Pattern Matching on Multiprocessor Systems

Authors:

Alexander Krause, Frank Ebner, Dirk Habich and Wolfgang Lehner

Abstract: Graph pattern matching (GPM) is a core primitive in graph analysis with many applications. Efficient processing of GPM on modern NUMA systems poses several challenges, such as an intelligent storage of the graph itself or keeping track of vertex locality information. During query processing, intermediate results need to be communicated, but target partitions are not always directly identifiable, which requires all workers to scan for requested vertices. To optimize this performance bottleneck, we introduce a Bloom filter based workload reduction approach and discuss the benefits and drawbacks of different implementations. Furthermore, we show the trade-offs between invested memory and performance gain, compared to fully redundant storage.

Paper Nr: 36
Title:

An Overview of the Endless Battle between Virus Writers and Detectors: How Compilers Can Be Used as an Evasion Technique

Authors:

Michele Ianni, Elio Masciari and Domenico Saccà

Abstract: The increasing complexity of new malware and the constant refinement of detection mechanisms are driving malware writers to rethink the malware development process. In this respect, compilers play a key role and can be used to implement evasion techniques able to defeat even the new generation of detection algorithms. In this paper we provide an overview of the endless battle between malware writers and detectors and we discuss some considerations on the benefits of using high level languages and even exotic compilers (e.g. single instruction compilers) in the process of writing malicious code.