ADITCA 2019 Abstracts


Full Papers
Paper Nr: 1
Title:

Pipelined Implementation of a Parallel Streaming Method for Time Series Correlation Discovery on Sliding Windows

Authors:

Boyan Kolev, Reza Akbarinia, Ricardo Jimenez-Peris, Oleksandra Levchenko, Florent Masseglia, Marta Patino and Patrick Valduriez

Abstract: This paper addresses the problem of continuously finding highly correlated pairs of time series over the most recent time window. The solution builds upon the ParCorr parallel method for online correlation discovery and is designed to run continuously on top of the UPM-CEP data streaming engine through efficient streaming operators. The implementation takes advantage of the flexible API of the streaming engine that provides low level primitives for developing custom operators. Thus, each operator is implemented to process incoming tuples on-the-fly and hence emit resulting tuples as early as possible. This guarantees a real pipelined flow of data that allows for outputting early results, as the experimental evaluation shows.

Paper Nr: 2
Title:

GPU Acceleration of PySpark using RAPIDS AI

Authors:

Abdallah Aguerzame, Benoit Pelletier and François Waeselynck

Abstract: RAPIDS AI is a promising open source project for accelerating Python end to end data science workloads. Our quest is to be able to integrate RAPIDS AI capabilities within PySpark and offload PySpark intensive tasks to GPUs to gain in performance.

Paper Nr: 4
Title:

Using a Skip-gram Architecture for Model Contextualization in CARS

Authors:

Dimitris Poulopoulos and Athina Kalampogia

Abstract: In this paper, we describe how a major retailer’s recommender system contextualizes the information that is passed to it, to provide real-time in-store recommendations, at a high level. We specifically focus on the data pre-processing ideas that were necessary for the model to learn. The paper describes the ideas and reasoning behind crucial data transformations, and then illustrates a learning model inspired by the work done in Natural Language Processing.

Paper Nr: 5
Title:

Recovery in CloudDBAppliance’s High-availability Middleware

Authors:

Hugo Abreu, Luis Ferreira, Fábio Coelho, Ana N. Alonso and José Pereira

Abstract: In the context of the CloudDBAppliance (CDBA) project, fault tolerance and high-availability are provided in layers: within each appliance, within a data centre and between datacentres. This paper presents the recovery mechanisms in place to fulfill the provision of high-availability within a datacentre. The recovery mechanism takes advantage of CDBA’s in-middleware replication mechanism to bring failed replicas up-to-date. Along with the description of different variants of the recovery mechanism, this paper provides their comparative evaluation, focusing on the time it takes to recover a failed replica and how the recovery process impacts throughput.

Paper Nr: 6
Title:

Fast and Streaming Analytics in ATM Cash Management

Authors:

Terpsichori-Helen Velivassaki and Panagiotis Athanasoulis

Abstract: Cash management across a network of ATMs can be greatly improved, exploiting a multitude of information sources, which however generate huge datasets, not possible to be analysed via traditional methods. Business Intelligence in the Banking, Financial Services and Insurance (BFSI) sector can be even more challenging when exploitation of real-time information streams is desired. This paper presents the uCash ATM cash management system, running on top of CloudDBAppliance, supporting fast analytics over historical data, as well as streaming analytics functionalities over real-time data, served by its Operational Database. The paper discussed integration points with the platform and provides integration hints, thus constituting a real use case example of the CloudDBAppliance platform, allowing for its further exploitation in various application domains.

Paper Nr: 7
Title:

Implementing Value-at-Risk and Expected Shortfall for Real Time Risk Monitoring

Authors:

Petra Ristau

Abstract: Regulatory standards require financial service providers and banks to calculate certain risk figures, such as Value at Risk (VaR) and Expected Shortfall (ES). If properly calculated, their formulas are based on a Monte-Carlo simulation, which is computationally complex. This paper describes architecture and development considerations of a use case building a demonstrator for a big data analytics cloud platform developed in the project CloudDBAppliance (CDBA). The chosen approach will allow for real time risk monitoring using cloud computing and a fast analytical processing platform and data base.

Paper Nr: 8
Title:

Parallel Efficient Data Loading

Authors:

Ricardo Jiménez-Peris, Francisco Ballesteros, Ainhoa Azqueta, Pavlos Kranas, Diego Burgos and Patricio Martínez

Abstract: In this paper we discuss how we architected and developed a parallel data loader for LeanXcale database. The loader is characterized for its efficiency and parallelism. LeanXcale can scale up and scale out to very large numbers and loading data in the traditional way it is not exploiting its full potential in terms of the loading rate it can reach. For this reason, we have created a parallel loader that can reach the maximum insertion rate LeanXcale can handle. LeanXcale also exhibits a dual interface, key-value and SQL, that has been exploited by the parallel loader. Basically, the loading leverages the key-value API and results in a highly efficient process that avoids the overhead of SQL processing. Finally, in order to guarantee the parallelism we have developed a data sampler that samples data to generate a histogram of data distribution and use it to pre-split the regions across LeanXcale instances to guarantee that all instances get an even amount of data during loading, thus guaranteeing the peak processing loading capability of the deployment.

Paper Nr: 9
Title:

Dynamic Data Streaming for an Appliance

Authors:

Marta Patiño and Ainhoa Azqueta

Abstract: Many applications require to analyse large amounts of continuous flows of data produced by different data sources before the data is stored. Data streaming engines emerged as a solution for processing data on the fly. At the same time, computer architectures have evolved to systems with several interconnected CPUs and Non Uniform Memory Access (NUMA), where the cost of accessing memory from a core depends on how CPUs are interconnected. In order to get better resource utilization and adaptiveness to the load dynamic migration of queries must be available in data streaming engines. Moreover, data streaming applications require high availability so that failures do not cause service interruption and losing data. This paper presents the dynamic migration and fault-tolerance capabilities of UPM-CEP, a data streaming engine designed to take advantage of NUMA architectures. The preliminary evaluation using Intel HiBench benchmark shows the effect of the query migration and fault-tolerance on the system performance.