You're reading from NoSQL Data Models Addresses severe issues related to NoSQL data models

Product type Paperback

Published in Aug 2018

Publisher Wiley

ISBN-13 9781786303646

Length 278 pages

Edition 1st Edition

Languages

Cypher

Tools

Cypher Query

Concepts

Database Programming

Author (1):

Olivier Pivert

View More author details

Table of Contents (11) Chapters

Preface

1 NoSQL Languages and Systems

2 Distributed SPARQL Query Processing: a Case Study with Apache Spark FREE CHAPTER

3 Doing Web Data: from Dataset Recommendation to Data Linking

4 Big Data Integration in Cloud Environments: Requirements, Solutions and Challenges

5 Querying RDF Data: a Multigraph-based Approach

6 Fuzzy Preference Queries to NoSQL Graph Databases

7 Relevant Filtering in a Distributed Content-based Publish/Subscribe System

List of Authors

Index

End User License Agreement

1.4. New challenges for database research

Since their appearance at the turn of the year 2000, NoSQL databases have become ubiquitous and collectively store a large amount of data. In contrast with XML databases, which rose in popularity in the mid-1990s to settle on specific, document-centric applications, it seems safe to assume that NoSQL databases are here to stay, alongside relational ones. After a first decade of fruitful research in several directions, it seems that it is now time to unify all these research efforts.

First and foremost, in our sense, a formal model of NoSQL databases and queries is yet to be defined. The model should play the same role that relational algebra played as a foundation for SQL. This model should in particular allow us to:

– describe the data-model precisely;
– express complex queries;
– reason about queries and their semantics. In particular, it should allow us to reason about query equivalence;
– describe the cost model of queries;
– reason about meta-properties of queries (type soundness, security properties, for instance, non-interference, access control or data provenance);
– characterize high-level optimization.

Finding such a model is challenging in many ways. First, it must allow us to model data as it exists in current – and future – NoSQL systems, from the simple key-value store, to the more complex document store, while at the same time retaining compatibility with the relational model. While at first sight the nested relational algebra seems to be an ideal candidate (see, for instance, [ABI 84, FIS 85, PAR 92]), it does not allow us to easily model heterogeneous collections which are common in NoSQL data stores. Perhaps an algebra based on nested data types with extensible records similar to [BEN 13] could be of use. In particular, it has already been used successfully to model collections of (nested) heterogeneous JSON objects.

Second, if a realistic cost model is to be devised, the model might have to make the distributed nature of data explicit. This distribution happens at several levels: first, collections are stored in a distributed fashion, and second, computations may also be performed in a distributed fashion. While process calculi have existed for a long time (for instance, the one introduced by Milner et al. [MIL 92]), they do not seem to tackle the data aspect of the problem at hand.

Another challenge to be overcome is the interaction with high-level programming languages. Indeed, for database-oriented applications (such as Web applications), programmers still favor directly using a query language (such as SQL) with a language API (such as Java’s JDBC) or higher-level abstractions such as Object Relational Mappings, for instance. However, data analytic oriented applications favor idiomatic R or Python code [GRE 15, VAN 17, BES 17]. This leads to inefficient idioms (such as retrieving the bulk of data on the client side to filter it with R or Python code). Defining efficient, truly language-integrated queries remains an unsolved problem. One critical aspect is the server-side evaluation of user-defined functions, written in Python or R, close to the data and in a distributed fashion. Frameworks such as Apache Spark [ZAH 10], which enable data scientists to write efficient idiomatic R or Python code, do not allow us to easily reason about security, provenance or performance (in other words, they lack formal foundations). A first step toward a unifying solution may be the work of Benzaken et al. [BEN 18]. In this work, following the tradition of compiler design, an intermediate representation for queries is formally defined. This representation is an extension of the λ-calculus, or equivalently of a small, pure functional programming language, extended with data operators (e.g. joins and grouping). This intermediate representation is used as a common compilation target for high-level languages (such as Python and R). Intermediate terms are then translated into various back-ends ranging from SQL to MapReduce-based databases. This preliminary work seems to provide a good framework to explore the design space and address the problems mentioned in this conclusion.

Finally, while some progress has been made in implementing high-level operators on top of distributed primitives such as MapReduce, and while all these approaches seem to fit a similar template (in the case of join: prune non-joinable items early and regroup likely candidates while avoiding duplication as much as possible), it seems that some avenues must be explored to unify and formally describe such low-level algorithms, and to express their cost in a way that can be reused by high-level optimizers.

In conclusion, while relational databases started both from a formal foundation and solid implementations, NoSQL databases have developed rapidly as implementation artifacts. This situation highlights its limits and, as such, database and programming language research aims to ‘correct’ it in this respect.