• Ingen resultater fundet

4.2. Technological Foundations 55

56 Chapter 4. Development Language and its subsequent use within the Social Set Visualizer poses several chal-lenges to the utilized database system. Therefore, a thoroughtechnical evaluation of various options for data storage is required in order to identify a suitable database for the unique analytical workload of Social Set Analysis with large datasets. Dur-ing the course of this PhD project, six different types of data storage systems which are well-suited for an implementation of the Social Set Query Language have been identified and evaluated:

• NoSQL database systems (NoSQL)such asMongoDB, a performant, cutting-edge approach to databases with a dynamic data schema based on the no-tion of documents with a flexible number of attributes, less strict compliance with atomicity, consistency, isolation und durability (ACID) than RDBMSs and mostly custom query languages.

• Relational database management systems (RDBMS) such as PostgreSQL, a solid, well-established type of database which provides strict compliance with ACID requirements on top of features such as a fixed data schema, stored procedures, transactions, and support of the Structured Query Language (SQL).

• Distributed database systemssuch asApache Spark, a highly scalable, cutting-edge, distributed database with immense storage capacity, support of parallel analytical data processing, and cluster-based scaling approach.

• Key-value database systems such as Redis, a stable, in-memory database with native support for set intersections and a high level of performance.

• Graph databases such as Neo4J, a modern, enterprise-ready graph database implemented in Java with a custom SQL-style textual query language for graphs called Cypher paired with strict ACID compliance.

• Implementation of a custom database system using theGoprogramming lan-guage which is tailormade for the unique data schema of the Social Interaction Model and optimized for the OLAP-style workload of the Social Set Query Language.

The creation of an interactive Visual Analytics software tool has unique require-ments on runtime optimization and cancellation of long-running queries. Therefore, a prototype data storage system supporting the Social Set Query Language has been implemented and evaluated for each of the six presented approaches.

NoSQL Database: MongoDB

MongoDB has been evaluated as storage solution for the Social Set Query Language based on its popularity as a NoSQL database. Due to its loose data schema defini-tions, Big Social Data can be stored in the same format as received from remote APIs.

While the flexible data schema of MongoDB massively simplifies the import process

4.2. Technological Foundations 57 of historic Big Social Data from software tools such as SOGATO and SODATO, a NoSQL storage is not suitable for strict implementation of a data schema according to the the Social Interaction Model. In case social media data from different sources and collection timeframes is merged, a common data structure is vital. In a NoSQL database, it is hard to follow a strict schema definition, as any kind of input can be stored in any table. The evaluation of NoSQL data storage for the initial versions of the Social Set Visualizer as published in Publication II [Flesch et al. 2015a] has underlined this issue. On top of that, the tooling for MongoDB does not provide enough means for runtime optimization of database queries. Therefore, the initial version of the Social Set Visualizer was subsequently implemented based on a rela-tional data schema which supports SQL. SQL lends itself to a highly dynamic way of building database queries, and provides many tools for performance improvements such as query optimization, table indexes, and stored procedures. MongoDB only provides custom querying functions which can be used through function calls in their NodeJS API. Paired with the difficulty of keeping Big Social Data in a certain struc-ture and not-too-strict ACID compliance, NoSQL databases in general and MongoDB in particular do not depict a suitable option for data storage.

Relational Database: PostgreSQL

A relational database implementation of the Social Set Query Language was evalu-ated using the PostgreSQL database. PostgreSQL provides a strict database schema, which is defined in line with the theoretical model of Big Social Data, the Social Inter-action Model. Furthermore, it gives mature tooling for query optimization purposes, including table indexes, query plan optimization and various configuration options for the PostgreSQL database service. During Publication II [Flesch et al. 2015a], it became apparent that the data import pipeline is more complex in a PostgreSQL scenario. Once the Big Social Data is loaded into the database system though, a lot of flexibility is gained through the availability of SQL and optimization features.

Therefore, PostgreSQL was chosen as primary database backend for the Social Set Query language and it remains as such until the latest version 3 of the Social Set Visualizer presented in this dissertation.

Distributed Database: Apache Spark

After the implementation of the Social Set Query Language inPublication II[Flesch et al.2015a] based on a PostgreSQL system, a thorough evaluation of potential per-formance improvements was performed. Hereby, special regard was set to scalable, distributed storage solutions such as Apache Spark for potential parallelization of analytical workloads. As PostgreSQL for data storage on its own has good, but not great performance, it was further attempted to empirically measure improvements in terms of execution time with our test data from other storage types.

The evaluation of the Social Set Query Language in Apache Spark resulted in mixed findings. On the one hand, the file-first approach of working with individual

58 Chapter 4. Development files in a distributed storage system promises great potential when working with multi-gigabyte sized exports of Big Social Data such as social media data from Facebook. On the other hand, the parallelization of analytical queries showcased a significant performance lag between execution of the queries and the return of the results. Due to the strong performance requirements towards an interactive Visual Analytics dashboard such as the Social Set Visualizer, the database system on which the Social Set Query Language is implemented needs to return results very quickly.

Concluding from the evaluation, Apache Spark is not able to achieve better per-formance in terms of execution time due to long preparation time for parallel queries.

Although the challenge of distributed storage of Big Social Data is well resolved by Apache Spark, low performance in terms of query execution time cannot be out-weighted by the mentioned benefits. Therefore, the utilization of PostgreSQL RDBMS for our purpose should not be discontinued in favor of Apache Spark.

Key-Value Database: Redis

In Social Set Visualizer 2, the Social Set Query Language was implemented in Redis, a key-value database, and evaluated as the main database for set computations.

This evaluation was described in Publication III[Flesch et al.2016]. Redis provides native set data types which can be utilized with set-theoretical operations such as union and intersection. All set intersection calculations have been outsourced from PostgreSQL to Redis. A dedicated Redis instance now performs memory intensive set intersection calculations with a significantly better execution speed and pipes the calculation results back to the user-facing dashboard in real time.

Even though the computational performance of set calculations within Redis is outstanding, the severe limitations in terms of working memory and by proxy -available research funds for extra-high-memory systems prevents a full focus on Redis as data storage solution. Any implementation of the Social Set Query Language with Redis as the sole data storage backend will be severely constrained by available working memory. Therefore, it is not financially feasible to implement all of the Social Set Query Language within Redis. Hence, in practice the use of Redis is always paired with a relational database such as PostgreSQL which depicts a back-up system to prevent loss of data.

Graph Database: Neo4J

Neo4J presents a modern graph-based database system. It was evaluated for storage of Big Social Data according to the Social Interaction Model and for its feasibility during evaluation of the third version of the Social Set Visualizer. It became ap-parent that the transformation of Big Social Data into a graph representation is a challenging undertaking. Even though Big Social Data structured according to the Social Interaction Model depicts several graph-like characteristics, the hierarchical overall structure could not be successfully stored in the graph-based database. The full potential of a graph-based database such as Neo4J could not be unleashed, as

4.3. Software Architecture 59