Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Translating a Distributed Relational Database to a Document Database

Translating a Distributed Relational Database to a Document Database The problem of transferring data from a database of one type to a database of another type is relevant today due to the increase in data volumes and the complexity of tasks solved in various fields of human activity. The existing databases are less and less satisfying the needs of users. New types of databases appear that are more suitable for solving large-scale problems. This article proposes an algorithm for solving the problem of optimizing the document structure of a database based on metadata about the structure of a distributed relational database from which data are transferred. The approach also takes into account information about the structure of the priority database queries. The priority of database queries is user-defined. The system of automatic translation of the database, taking into account these metadata, allows the user to create a distributed document database that is optimal in two parameters: in terms of the volume of stored data and in terms of the execution time of priority database queries. Keywords NoSQL · Database query · Collection · Document · Database structure optimization · Embedded documents 1 Introduction the complexity will increase by an order of magnitude, and the characteristics that make relational databases so attrac- Over the past decades, the variety of database applications tive rapidly reduce to zero the chances of using them as a has increased, and with them the importance of database platform for large distributed systems. features such as simplicity, robustness, flexibility, perfor - The solution to this problem is to transfer systems to other mance, scalability, and compatibility has increased. And types of databases that have a higher scalability, but with the with the increase in the number of databases, one feature loss of other capabilities available in relational databases. becomes more important than all the others. It is scalability. It must be said that this solution has a number of difficul - As more and more applications run under high load condi- ties: the choice of a database type suitable for an applied tions, their scalability requirements can change and grow task, in the case of a document database or a database of the very quickly. Relational databases scale well only if they are column family type; choice of a global database structure in located on a single server. When the resources of this server an existing distributed system and a local structure of objects run out and it becomes necessary to increase the number of in a database, transferring data from an existing database to computational nodes and distribute the load between them, a new type of database. then problems arise with the growth of the complexity of The problem is that document database management sys- relational databases, which directly depends on the amount tems do not perform " join" operations well, so it is impossi- of scalability. So, if there is an increase in the number of ble to translate a distributed relational database into a docu- servers not to a few, but to a hundred or a thousand, then ment database format on a one-to-one principle. The use of this principle will lead to the opposite effect: a decrease of performance due to complication of queries and their partial * Yulia Shichkina implementation in client applications (those parts of queries strange.y@mail.ru that cannot be implemented by a document DBMS, e.g., the Muon Ha "join" operation). Currently, there is no formalized way to muon.ha@mail.ru build an optimal distributed document database structure based on relational database metadata for translating data Faculty of Computer Science and Technology, Saint from one format to another. In this paper we propose a solu- Petersburg Electrotechnical University “LETI”, St. Petersburg, Russia tion to this problem. Vol:.(1234567890) 1 3 Translating a Distributed Relational Database to a Document Database 137 In earlier articles, we considered the problems associated specific subject area. The third section provides a methodology with the optimization of the structure of non-distributed doc- for applying methods of optimizing the structure of document ument databases and databases such as a family of columns, databases, taking into account the structure of planned queries taking into account the structure of the planned queries [1]. for the case of a distributed database. The fourth section pre- We also described the methodology for applying methods sents the results of testing the proposed methodology on data- for optimizing the structure of a database, depending on the bases of various sizes with queries of various structures and stage at which this optimization is carried out: at the initial degrees of their embeddedness. In the conclusion, the main stage of creating a new database, at the stage of translating results of the research are summarized. data from relational databases to NOSQL databases, or when consolidating databases of various types. But, all these solu- tions concerned only centralized databases.2 Overview of Related Works Currently, due to the large increase in data volumes, dis- tributed databases are increasingly used. Therefore, in this The problem of translating relational databases into NoSQL article, we show how methods designed for centralized data- is relevant and solved by many researchers. The main dif- bases would work when used to translate distributed rela- ficulty in solving this problem is the difference in data struc- tional databases to distributed NOSQL databases. According tures in relational databases (RDBs) and NoSQL. to our methodology described in [xxx], these methods can The paper [4] offers a solution for quick data migration also be used when creating new document databases. from a relational database into a document-oriented data- The proposed methods are based on graph theory, set base. Authors have created semi-automatically two logical theory, and parallel computing theory: levels over physical data. Users can refine generated logical data model and configure data migration template for each – Elements of set theory allow to formalize the process of needed document. Data migration features are implemented choosing the optimal structure of documents in collec- into relational database browser Dig Browser. But, the docu- tions of the MongoDB database, taking into account the ment does not present solutions to the problems of optimiz- architecture of the computing system, the set of elements ing the database schema and database queries. stored in the database, and queries to the database that The paper [5] describes some approaches of migration are most often executed or require the highest execution and proposes an approach of model transformation from speed. This is the first necessary step to achieve an opti- object relational database to NoSQL document database mal balance between the amount of stored data and the (MongoDB). speed of execution of queries to the database. In article [6], the authors describe a semi-automated – Elements of graph theory allow visualizing and formal- approach to migrate highly dynamic SQL-based web appli- izing the query execution plan depending on the architec- cations to ones that use document-oriented NoSQL data- ture of the computing system and the database schema. bases such as MongoDB using source analysis and transfor- – Elements of the theory of parallel computing allow to mation techniques. Authors demonstrate our semi-automated formalize the process of optimizing the query execution framework on the analysis and migration of three existing plan. We use the theory of parallel computing based on web applications to extract, classify, translate, migrate, and information graphs [2, 3]. optimize the queries and migrate the PHP code to interact with the migrated database. Authors approach provides Thus, our proposed approach to translating a distributed migrating and optimizing the embedded SQL queries to relational database into a distributed document database is interact with the new database system and changing the based on sequential steps: optimizing the document database application code to use the translated queries. But, this schema based on metadata about the relational database, build- approach does not take into account distributed database ing query execution plans, optimizing query execution plans, schemas and query optimization based on this schema. translating queries from SQL format to MongoDB format. Approaches to the translation of databases from the RDB The article is structured as follows. The second section pro- format to the format of documentary NoSQL databases are vides an analysis of existing studies close to the topic of our described in various publications [7–12]. Articles [7, 8] research. This section shows that despite a huge number of describe automatic solutions in which the input data are publications on the topic of relational databases and NOSQL RDB metadata and ER diagrams. In [8], an algorithm is and their mutual transfer from one format to another, there is presented that takes into account the dependencies between no research devoted to formalized methods of data transla- tables and the number of primary keys and foreign keys in tion in distributed databases. There are separate publications the conversion process. In [9], an algorithm is presented that in which the authors talk about their experience of translating uses dependencies between tables, but data embeddedness data in distributed databases to solve problems in a narrow is used only in relation to a foreign key to a primary key. In 1 3 138 M. Ha, Y. Shichkina article [7], tables are classified according to four types: main, interact with any software supporting JDBC. A virtual- subclass, related, and general. The algorithm presented by ization system is built on top of the NoSQL sources that the authors uses this classification to include subclass and translates SQL queries into the source-specific APIs. The general tables in the main table. Table of relationships is virtualization architecture allows users to query and join data converted by using the links. This research mainly focuses from both NoSQL and relational SQL systems in a single on transforming a relational concept (e.g., table, relation- SQL query. ships, etc.) into a NoSQL (document database) concept In [16], it is proposed to analyze log files to determine (document, subdocuments, etc.). The research proposed which tables are frequently involved in a query with the by the authors [10] has a different classification, consisting “join” operator. The authors suggest placing these tables on of four classes: codifier, simple entity, complex entity, and the same node, if possible. This makes it possible not to use N:N-relationship. In addition to classification, the user must the "join" operator. provide a “focused table” that represents the target NoSQL In general, the task of translating queries between the entity. The algorithm creates a tree with related tables. In types of RDB and NoSQL databases is not a solved prob- [11], the authors propose the use of tags to control the trans- lem when translating databases. Analysis of publications lation of databases. The user tags the ER diagram with tags shows that the following directions of its solution can be that represent the characteristics of the data and queries. distinguished: Based on these tags, the algorithm decides to use embedded documents or links in the conversion. - Development of middleware software that executes SQL The authors [12] used formal concept analysis, con- commands to process data for NoSQL databases. This ceptual scaling, and relational concept analysis to create a approach was proposed in [17–19]. conceptual model that helps users define the structures of - Translation of queries from SQL to NoSQL format. For a NoSQL database. This model provides translation rules example, [20] is a web translator that takes a SQL query for all types of relationships between relational data tables and generates an equivalent MongoDB query. The trans- (1–1, 1-n, n–n). lation is based solely on the syntax of the SQL query. The In connection with the increase in the volume of data, approach does not take into account any data or schemas. there has recently been an increased attention to the research There is no explanation for the approach to translation. of translation processes and optimization of queries when Russell [21] describes a library that provides an API for translating data from one format to another. translating SQL queries to MongoDB. The translation The document [13] proposes an application with a graphi- is based on SQL query syntax only. The system from cal user interface for data transfer and automatic query con- [22] requires the user to provide a MongoDB schema version. The system that authors propose can be subdivided expressed in relational form using tables, procedures, into an online database application and a query conversion and functions. utility. The client application acts as an environment that allows the users to select and convert the databases from Among the studies directly related to the optimization of SQL to MongoDB. The article also discusses the structure, queries to relational databases, it is possible to single out data types, and keys. The query conversion utility provides [23]. This document describes the methods used to optimize the user with graphical user interface that allows him to relational queries in an SDD-1 distributed database system. choose from some basic predefined SQL queries or write his Queries are submitted to SDD-1 in a high-level procedural own SQL query. This way the user can implement his que- language called Datalanguage. ries even if he is not familiar with the MongoDB database. The following papers describe solutions for query The article [14] is devoted to the problems of transla- performance: tion of queries from a distributed relational database into The article [24] describes the Perfopticon system, which NOSQL. This article discusses column-oriented NoSQL is an interactive query profiling tool. This tool allows you DBMS, HBase, because it is widely used by many Inter- to quickly identify performance bottlenecks and data skew. net enterprises such as Facebook, Twitter, and LinkedIn. Perfopticon combines interactive visualizations of (1) query Because HBase does not support SQL, authors use Apache plans, (2) overall query execution, (3) data flow among serv - Phoenix as an SQL layer on top of HBase. The authors also ers, and (4) execution traces. These views coordinate mul- confirm that important unsolved problems are supporting tiple levels of abstraction to enable detection, isolation, and complex SQL queries, automatic index selection, and opti- understanding of performance issues. mizing SQL queries for NoSQL. The article [25] provides an overview of query optimiza- In article [15], authors present a generic standards-based tion systems. In particular, it was considered that AQUA architecture that allows NoSQL systems, with specific focus [26] is a query optimizer which aims at reducing the cost on MongoDB, to be queried using SQL and seamlessly of storing intermediate results by optimizing the order of 1 3 Translating a Distributed Relational Database to a Document Database 139 join operations. QMapper [27] is an intelligent translator 3 Methods on Which Distributed Database to enable automatic rule-based SQL-to-HiveQL mapping Translations are Based as well as cost-based optimizing of the translated queries. There are also a number of articles devoted to translat- The method for optimizing the schema of document ing databases into other types of NoSQL databases. Among databases taking into account the structure of executed them are the following. queries for the case of a distributed database is based on The approach described in [28] is for creating data struc- similar methods for centralized databases described in tures for column-oriented databases such as HBASE. The [32] and methods for optimizing the graph of information authors propose to analyze the primary key and foreign key dependencies. of relational tables and then create a large table that stores The essence of the methods for optimizing the structure all related information. The main focus of this approach is of document databases, taking into account the structure of to create a key for a large table. The authors [11] also formu- executed queries for centralized databases, consists in the lated some rules for converting a table with one embedded following postulates: and several embedded DBMS tables to a column-oriented HBase database. 1. It is necessary to present a relational database in the The easiest way to translate a relational database into a form of a single set of fields, and all queries are also in graph database is to simply convert each table record to a the form of a single set of fields. Then, by iteratively node, grouping the set of nodes by label name. Foreign keys applying set subtraction and set intersection operations, between tables are converted into graph edges connecting it is possible to obtain collections of fields that will cor - two nodes [29]. The authors of [30] proposed the transfor- respond to the collections in the document database. mation of the 3NF-form of the RDB into the form of the 2. It is necessary to take into account during these opera- 3EG graph model of the database. The authors used a rela- tions the presence of types of links between tables of tional database that exists in 3NF as input data and proposed the relational database. Then, based on the collections four rules for transforming relational database tables into a obtained from the results of the first postulate, it is pos- graph database. sible to optimize the structure of the document database The authors of [31] also proposed an algorithm for con- by creating embedded documents. verting a relational database into a graph database. They combine key values from more than one row in a single These two methods of optimizing the structure of a docu- node, so that when a user needs relevant information, he can ment database without and with embedded documents are get it by visiting only one node. They define some rules for described in detail in [2]. different cases of grouping information from different rows The aim of the current research is to find approaches to of a table in one node. They created a description of the full account for the structure of a distributed database for accel- path along the graph, which allows to go from the source erated query execution. node to the destination node, and find information from a A database query is a set of instructions that describe the given node, which must be duplicated in another node. order of actions of the executor to achieve a certain result. Our analysis shows that the problem of translating data After the development of parallelism in the operation of from one format to another is currently a very actual prob- computers, the word "sequence" began to be replaced by lem. However, a good solution to this problem has not yet the more general word "order." It is possible to say that a been found. Many researchers are looking for a solution. query to a database is a determination of the order and con- There are specific solutions for individual areas, for direct ditions for selecting data, and the process of creating a query translation of tables into collections, for transferring RDBs is a decomposition of a query into elementary subqueries. to document databases, taking into account the types of rela- An elementary subquery can be understood as a construct tionships between tables, for extracting data from document «Select» without embedded subqueries. If a query contains databases using SQL queries. However, the researchers have an elementary query (subquery) inside it, then it is called the not yet proposed a general approach that would take into parent, and the subquery is called the child. account the database schema, the computing system archi- An information graph is further understood as a directed tecture, and the query structure. Our research is devoted to acyclic multigraph, the vertices of which correspond to finding such a general approach. elementary queries within the main query, and the edges This article proposes a formalized approach that takes correspond to data transfer between them. into account RDB schemas, SQL query schemas, and shard- The principle of efficient organization of parallel query ing and replication schemes to create a distributed document processing implies the search for subqueries that are inde- database. pendent of each other according to the data. When creating 1 3 140 M. Ha, Y. Shichkina a new query model, it is important that query properties 4. Adaptation of the graph for horizontal sharding for the such as computational accuracy and query robustness are case when the table "T" of the RDB is distributed into k preserved. For this, it is important to know which subqueries shards: have input data that result from the execution of other sub- a. Create k input vertices of the graph, each of which queries. These links are reflected in the information graph. will correspond to a certain part of the table on a An important role in this is played by the concept of a paral- certain shard: T , i = 1…k. lel graph form. b. If table T was connected by an edge with query Q, When modifying the parallel form of the information then create k graph vertices, each of which will cor- graph, the structure of the graph does not change, but only respond to query Q executed for a certain part of the its projection on the plane. Due to this, a new query created table on a certain shard: Q , i = 1…k. from the modified parallel form of the information graph is equivalent to the original query. But, with this approach, Further, one of three options is possible: a global extremum may not be achieved in optimizing the Option 1. solution to the problem in terms of certain parameters, for example, in time. This approach will find the most optimal i. Connect all the vertices that were previously con- query among all equivalent queries, taking into account data nected to the vertex Q with the vertices Q , i = 1…k; relationships between subqueries. ii. Connect each vertex corresponding to a part of the To construct a parallel plan of queries in the first approxi - table T with a vertex corresponding to the query Q mation, it is possible to use the parallelization method of and executed for this part of the table T, i.e., the T part classical algorithms, based on the adjacency lists of the must be connected to Q (an example of this option is information graph of the algorithm [14]. Unlike classical shown in Fig. 1b); algorithms, an information query graph is created much iii. If the vertex Q was not the output vertex of the graph, easier and has much fewer vertices. then: create a vertex Q’, which will correspond to the Our research has shown that all methods for optimizing query on data aggregation from queries Q over parts an information graph for various parameters (time, compu- of the table T . tational nodes, volume of interprocess transfers, etc.), devel- oped for classical algorithms, are completely suitable for Option 2. queries. The article [32] shows how it is possible to modify the database query information graph in order to increase i. Connect each vertex corresponding to a part of the the degree of query parallelism. table T with a vertex corresponding to the query Q and executed for this part of the table T, i.e., the T part must be connected to Q (an example of this option is 4 Creating a Document Database Schema shown in Fig. 1c); Taking into Account the Information ii. If the vertex Q was not the output vertex of the graph, Graph of the Database Query then: create a vertex Q’, which will correspond to the query on data aggregation from queries Q over parts In our study, we used an approach based on information of the table T ; graphs to find the dependences of subqueries on the location iii. If there are vertices (except T) that were previously and data availability in accordance with a given distributed connected to the vertex Q, then create a node Q’’ and database schema and information about shards and replicas. connect all the vertex and the vertex Q’ with the ver- A method for constructing an information query graph for tex Q’’; a distributed relational database taking into account infor- mation about table fragmentation. Option 3. 1. To each RDB table put in correspondence the input ver- i. Connect all the vertices that were previously connected tex of the information graph. to the vertex Q with the vertices Q , i = 1…k; 2. Associate each elementary subquery with the vertex of ii. Connect each vertex corresponding to a part of the table the information graph. T with a vertex corresponding to the query Q and exe- 3. Connect the vertices of the information graph with cuted for this part of the table T, i.e., parts T must be directed edges in accordance with the rule: if subquery connected to Q (an example of this option is shown in A receives data from table (or subquery) B, then connect Fig. 1d); vertices A and B with a directed edge from B to A. 1 3 Translating a Distributed Relational Database to a Document Database 141 Fig. 1 Information graphs of a query to the database: a initial graph, b graph corresponding to 12 option 1, c graph corresponding 1 to option 2, d graph correspond- Q Q’ Q Q 11 2 1 2 ing to option 3 (a) (b) Т Q 12 12 12 22 Q’ Q Q’’ Т Q’ 11 Q 11 Q Q 11 21 2 2 Т Т (d) (c) iii. If the vertex Q was not the output vertex of the graph is used, taking into account information about fragmentation b and was connected by an edge to the query Q , then of tables with modification. The modification consists in not create k vertices of the graph, each of which will cor- doing sub-steps related to data aggregation in step b. This is respond to the query Q , executed for a certain part of step iii for options 1 and 2, step v for option 3. the query Q on a certain shard: Q , i = 1…k. Note. The choice of option 1–3 depends on the types of i 1i iv. Connect each vertex corresponding to the part of the operators that are in the database query. So, for example, for query Q with the vertex corresponding to the part of the a database query: query Q , i.e., connect Q with Q ; 1 i 1i v. If the vertex Q was not the output vertex of the graph, Select f1 from (select f1, f2 from t1, t2 where t1.key = t2. then: create a vertex Q’’ that will correspond to the key and condition) as Q, t3 where Q.f1 = t3.key and con- query on data aggregation from queries Q ; dition; 1i c. If there are more tables in the graph that are also distrib- uted across shards, then step b) is repeated for each such option 3 will be optimal for execution. table. To query the database: An example of creating an information query graph for Select f1 from t3 where f1 > (select count(*) from t1, t2 the case of a distributed database. Let there be a database where t1.key = t2.key and condition) and condition; query information graph (Fig. 1a). Let the table T1 be dis- tributed over 2 shards. Then the information graph of the option 2 will be optimal for execution. database query, taking into account the sharding scheme, To query the database: can have one of the forms shown in Fig. 1b–d according to options 1–3 described in the method. Select f1 from t3 where f1 in (select count(*) from t1, t2 Note. For the case of presence of replicas of tables on where t1.key = t2.key and condition) and condition; different shards, the same method of constructing an infor - mation graph of a query for a distributed relational database option 3 will be optimal for execution. 1 3 142 M. Ha, Y. Shichkina A method for constructing a document database schema Fig. 2. Table T1 is the main table in the link between tables taking into account the structure of a distributed relational T1 and T2 (Fig. 2). Table T2 is the main table in the link database: between tables T2 and T3. Consider a query that finds the number of values in field 1. Design an information graph without sharding and rep- T1.A that correspond to values in field F that are a multiple lication. of 5 (listing 1): 2. Transform the information graph to a parallel form that Listing 1 minimizes the transfer of data from shard to shard using Select SQL_NO_CACHE count(*) From t1 where A the method of optimizing parallel algorithms by the in (Select A from t2 where D in (Select D from t3 where number of communications [30–32]. F%5=0)); 3. Perform step 1–2 for all queries, whose structure should be taken into account when optimizing the document database schema in order to speed up the execution of SELECT SQL_NO_CACHEcount(*) these queries. 4. Based on the metadata about tables, fields, queries and FROM t1 Q1 relationships using methods of optimization of the docu- WHERE A IN ment database schema with nested or without embedded documents, form the structure of document collections. 5. Improve the structure of documents, taking into account SELECT A the graphs of information dependencies of queries, mini- FROM t2 Q2 mizing the number of data joining operations. WHERE D IN 5 Testing the Approach to Translating SELECT D Distributed Databases into Document FROM t3 Q3 Databases WHERE F%5 = 0 To test the approach to translating distributed databases into document databases, a test relational database of three tables (DB_TEST) was created, the structure of which is shown in Fig. 3 Subqueries in the original query Fig. 2 The structure of the test database 1 3 Translating a Distributed Relational Database to a Document Database 143 T3 Q1 Q2 Q3 T1 T2 T2 T1 T3 (Partition 2) T3 (Partition 1) Fig. 4 Option A: information graph of a query to a centralized rela- Shard1 tional database Shard2 Fig. 5 Database schema DB_TEST, in which table T3 is distributed This query works with three tables. If the query structure over two shards shown in Fig. 3. highlight subqueries Q1, Q2, Q3,it is pos- sible to design the information graph of the query for the case of a centralized database (Fig. 4). Select D as Q1_1 from t3 where F%5 = 0 // This is a sub- The following describes testing a query to a MongoDB query Q1 to the first part of table T 3. The first part of the database without embedded documents based on a schema table T3 is indicated on the graph: T3 (Partition 1); with sharding and replicas and in accordance with option Select D as Q1_2 from t3 where F%5 = 0 // This is a sub- 1 of the method for constructing an information query query Q1 to the second part of table T3. The second part graph. of the T3 table on the graph is designated as T3 (Parti- The first step was to define the structure of the collection tion 2); in MongoDB by the method of optimizing the document database schema without embedded documents. In order to execute query Q2 on the results of query Q1, Three tables were used as input data: T1(A, B, C), T2 (D, which consists of two parts, it is necessary to combine these A, E), T3(D, F, G). parts of the result into a single part. For this, another sub- The output data—the document database collections: query Q4 is created, which is executed in the application and corresponds to the code: V1 = {T1.B, T1.C, T2.E, T3.G} V2 = {T1.A, T2.D, T2.A, T3.D, T3.F} Select * from Q1_1 union Select * from Q1_2 Select A from t2 where D in Q4 // This is a subquery Q2 Further, there is an option for this database in a dis- to the results of query Q1 and table T2; tributed form with the presence of sharding, but without Select SQL_NO_CACHE count(*) From t1 where A in replications. Figure 5 shows the schema of the DB_TEST Q2 // This is a subquery Q3 to the results of query Q2 database, in which the T3 table is distributed over two and table T1. shards. In this case, a horizontal fragmentation scheme is used, when records from table T3 contain values for all Note: that the information graph can be different. Optimi- fields of table T 3 and are distributed among different shards zation methods for the classical algorithm represented by an according to some rule, for example, according to a hash information graph are described in [32]. key. The case of vertical fragmentation can be thought of If we add a replica for one of the tables to option C of as dividing one table into several projections in the sense a distributed relational database, for example, for table T2 of relational algebra, linked by a one-to-one relationship. (Fig. 7), then a possible information graph of the query from Therefore, this case is a special case of a centralized data- Listing 1, executed on DB_TEST, can be as in Fig. 8. base and is not considered by us when looking for a formal- In this variant, the original query is divided into subquer- ized approach to translating distributed databases from one ies, which are executed in parallel to shards, including a format to another. shard with a replica: Figure 6 shows a possible information graph for a data- base query from Listing 1, executed on a database with the Select D from t3 where F%5 = 0 // This is a query Q1 for schema in Fig. 5. This graph corresponds to the query to the the first part of table T3; distributed database without replication, but with sharding. Select D from t3 where F%5 = 0 // This is a query Q1 to The original query (Listing 1) has been split into subqueries the second part of the table T3; that can be executed in parallel on shards: 1 3 144 M. Ha, Y. Shichkina Fig. 6 Option B: information graph of a query to a distributed Q1 relational database with several T3 (Partition 1) shards, but no replications Q4 Q2 Q3 union Q1 T3 (Partition 2) T2 T1 Fig. 7 Database schema DB_TEST, in which table T2 has a replica Select A from t2 where D in Q1 // This is a query Q2 to To test these three options A, B, C, MariaDB was used the results of query Q1 on the first part of table T 3 and with the Spider storage engine, which allows you to create to table T2; distributed databases from a standard MariaDB installation Select A from t2_copy where D in Q1 // This is a query Q2 and ensures data consistency when sharded. The table T3 to the results of query Q1 on the second part of table T3 was sharded in accordance with the hash algorithm on the and to table T2; field F. Select SQL_NO_CACHE count(*) From t1 where A in Q4 // This is a Q3 query to the results of Q4 query, which combines the results of previous queries, and to table T1. 1 3 Translating a Distributed Relational Database to a Document Database 145 Fig. 8 Option C: information graph of a query to a distributed T2' relational database with several shards and replicas Q1 Q2 T3 (Partition 1) Q4 Q3 union Q1 Q2 T3 (Partition 2) T1 T2'’ Table 1 Volumes of the test database DB_TEST calculated from the moment the query was sent until the moment the results were returned. T1 T2 T3 - For variants B–C of the DB schema, the program exe- Volume 1 1000 3000 9000 cutes sub-queries to shards in parallel. Then the program Volume 2 10 000 30 000 90 000 combines the results obtained from the shards according Volume 3 100 000 300 000 900 000 to the principle of the "Union" mechanism and sends the Volume 4 1000 000 3 000 000 9 000 000 result to the input of the main query. The processing of a database query for variants B–C of the database schema can be represented as follows (Fig. 9): To test the database DB_TEST, data were generated Table  3 shows the query execution time (in seconds) in four different volumes, the sizes of which are shown in Table 1. using the developed program: In Fig. 10 shows an example of the program execution Table 2 shows the volumes of the distributed database DB_TEST in accordance with the diagram in Fig. 5. result for the case of a distributed database structure with a volume of 9,000,000 records. The table T2_copy is a Node2 replica of the table T2 from Node1. Query execution time graphs for different variants of database schemas and data volumes are shown in Fig. 11. Since MariaDB with the Spider storage engine does not support a function for making database queries to individ- Figure 11 shows that for small amounts of data, a query to all three variants of the database scheme works almost ual shards, a separate program was created for testing. This program is developed in the Python programming language the same, but with an increase in data volumes, the time difference between a sequential query and a parallel query with the multiprocessing library. The way in which the program performs a database query increases. In this case, the fastest query is performed to the database, which contains not only sharding, but also on the MariaDB database depends on the variant of the data- base structure A-C. replication. Testing the approach to translating a database with - For variant A of the DB_TEST database schema (not embedded documents, taking into account the structure of the information graph, is shown below. distributed), the execution of the program consisted in the direct execution of the query, and the execution time is Table 2 Volumes of the Node 1 Node 2 distributed database DB_TEST T1 T2 T3 T1 T2_copy T3 Volume 1 1000 3000 4500 Heт 3000 4500 Volume 2 10,000 30,000 45,000 Heт 30,000 45,000 Volume 3 100,000 300,000 450,000 Heт 300,000 450,000 Volume 4 1000,000 3000,000 4500,000 Heт 3000,000 4500,000 1 3 146 M. Ha, Y. Shichkina Execution subquery on Node 1 -Connecting to Server 1onNode1 “UNION” of Execution of Start Finish -Connecting to Server results general query 2onNode2 Execution subquery on Node 2 Fig. 9 Processing database query for variants B–C database schema Table 3 Query execution time In this testing, Option 1 or Option 2 of the information query graph method using an example of a database with Volume 1 Volume 2 Volume 3 Volume 4 embedded documents was evaluated to be faster. Option 1 0.00498 0.0651 0.708 6.37 A distributed database according to the scheme of the Option 2 0.001966 0.01898 0.207 3.655 second variant of the method for constructing an information Option 3 0.00097 0.00898 0.1027 1.351 graph of a database query is shown in Fig. 12: Note. In this case, duplication of fields was not done because the union of the results is done after the first sub- The first step was to determine the structure of the col- query is executed. Translation of subqueries Q1, Q2, Q3 for option B lection in MongoDB using the method of optimizing the schema of a document database with embedded documents. (Fig. 6) from SQL format to MongoDB format was carried out taking into account the execution plan of the relational Three tables were used as input data: T1(A, B, C), T2 (D, A, E), T3(D, F, G). query. An example of a Q1 subquery obtained as a result of translation from SQL to MongoDB format for one shard is The output data—the document database collections: shown below: V1 = {T1. B, T1.C, T2_of_T1 {T2.E, T3_of _T2 ∶ {T3.G}}}. V2 = {T1.A, T2_of_T1 ∶ {T2.A, T2.D, T3_of _T2 ∶ {T3.D, T3.F}}} Fig. 10 An example of program execution 1 3 Translating a Distributed Relational Database to a Document Database 147 Fig. 11 Query execution time for different variants of the data- base schema and data volume Fig. 12 DB schema with embedded documents 1 3 148 M. Ha, Y. Shichkina Note. Because the second subquery works with the results Taking into account the specification of the operation of the first query and with fields from external documents in of the MongoDB DBMS, the Q1, Q2, Q3 subqueries are relation to embedded documents, then it is necessary in this executed in the MongoDB query language, and the query case to duplicate all fields on both shards. to combine the results of the Q1 subqueries to the first part As in variant B, in this variant the subqueries Q1, Q2, Q3 of the T3 table and Q1 to the second part of the T3 table is were translated into the MongoDB query language. Query executed in a Python program. This is because MongoDB Q4 to combine the results of subqueries Q1, Q2 was exe- does not support join operations. The query execution plan cuted by a Python program. is shown in Fig. 13: The Python code for combining query results is: The distributed database according to the scheme of vari- The subquery codes that are executed on each shard are ant C of the method for constructing the information query shown below: graph is shown in Fig. 14: T3_1 Result T1 T2 Union User T3_2 MongoDB MongoDB + Python Fig. 13 Query execution plan using the program 1 3 Translating a Distributed Relational Database to a Document Database 149 Fig. 14 DB schema with embedded documents T2 T3_1 Result T1 Union T2 T3_2 MongoDB+ Python MongoDB Fig. 15 Query execution plan using the program 1 3 150 M. Ha, Y. Shichkina Table 4 Volumes of the test database DB_TEST: The query execution plan is shown in Fig. 15: To test queries for the DB_TEST database, data were T1 T2 T3 generated in four different volumes, the sizes of which are Volume 1 1000 3000 9000 shown in Table 4. Volume 2 10,000 30,000 90,000 Table 5 shows the volumes of the distributed database Volume 3 100,000 300,000 900,000 DB_TEST in accordance with the scheme in Fig. 5. Volume 4 1000,000 3000,000 9000,000 Option 1 is a variant of a centralized database, option 2 is a distributed database with an information query graph according to option 2 of the method for constructing an Table 5 Volumes of database data DB_TEST on relational database information query graph, option 3 is a database with an shards information graph of queries according to option 3 of a method for constructing an information query graph. Option 2 Option 3 Table 6 shows the volumes of the distributed database Node 1 Node 1 Node 2 Node 2 DB_TEST in accordance with the scheme in Fig. 5. Volume 1 1000 1000 849 849 Table  7 shows the query execution time (in seconds) Volume 2 10,000 10,000 8682 8682 using the developed program: Volume 3 10,000 10,000 87,461 87,461 Query execution time graphs for different variants of the Volume 4 1,000,000 1,000,000 874,525 874,525 database schema and data volume are shown in Fig. 16 From the diagram in Fig. 16, it can be seen that the maxi- mum parallelization of a query in the database before com- Table 6 Volumes of database DB_TEST on MongoDB database bining the result works fastest with the application of an shards external program. In our early article [32] we showed that queries to a cen- Volume Option A Option B Option C tralized database with an optimized structure using a set Node 1 Node 2 Node 1 Node 2 theory method are significantly faster than a database built Volume 1 344,7 кб 267 кб 183 кб 213,2 кб 203,9 кб manually without optimizing the structure. The results of Volume 2 3,4 мб 2,61 мб 1,8 мб 2,1 мб 2,0 мб this study show that it is possible to further speed up query Volume 3 33,7 мб 26,12 мб 18,06 мб 20,8мб 20,1мб execution if the system is distributed and the architecture of Volume 4 336,6 мб 261,3 мб 180,69 мб 208,37мб 200,71 мб the distributed system is taken into account when construct- ing queries. So in Figs. 11 and 16 it is possible to see that the centralized database even with an optimized structure Table 7 Time of execution of queries to DB_TEST (Option 1) has the longest execution time. Options 2 and 3 allow you to execute some of the subqueries in parallel and Volume 1 Volume 2 Volume 3 Volume 4 therefore work faster. Option A 0.0464 0.1423 1.297 13.445 It should also be noted that the accelerated version Option B 0.0189 0.112 1.089 11.531 requires more memory to store duplicate data. It is possi- Option C 0.0114 0.0818 0.7502 8.884 ble to see the results of testing all three options for the use of RAM and permanent memory, which are presented in Table 8. 1 3 Translating a Distributed Relational Database to a Document Database 151 Fig. 16 Query execution time for different variants of the data- base schema and data volume The table shows the amount of used memory for two sizes of data, so its execution speed is also higher. In Option 3, of the database. For the other two database volumes, the most of the data is processed at the first stage (execution of picture of used memory is similar. More clearly the use of queries Q1, Q2) in parallel. RAM is shown in Fig. 17. Figure 17 shows that the fastest Option 3 is more memory intensive than Option 2. This is due to the fact that more 6 Conclusion duplicate data are placed in memory to speed up the execu- tion of the query. Nevertheless, Option 3 is much more eco- This article has shown that when creating a distributed docu- nomical compared to Option 1. ment database, there is a need to consider: To assess which operations in the query take the long- est time with different sharding and replication options, the 1. The structure of queries, relationships between tables. test was conducted for a database with a volume of 1000 In accordance with these data, it is necessary to create 000 records. Based on the test results a diagram was built a document database model with or without embedded (Fig. 18). documents. The length of each block on the diagram corresponds to 2. The presence of shards and replicas. The presence of the execution time of separate query operation (Q1–Q4). replicas significantly speeds up the process of execut- The execution time of each operation was measured using ing queries, because allows to maximally parallelize a a specially designed program. Twenty-five tests were car - complex query based on the information graph of the ried out. The diagram shows the average results. From the query. diagram in Fig. 18 it is possible to see that in Option 3 the 3. Internal dependencies and query operations. This deter- execution time of parallel subqueries (Q1–Q2) on the shards mines the way the information graph is presented and, is longer than in Option 2. But due to the fact that most of subsequently, the structure of the document database the work has been parallelized the "join" operation (Q4) in collections. Option 3 is performed for less data and therefore executed 4. Consensus between memory size and query execution faster than in Option 2. Paralleling most of the work is not speed. Database queries in which maximum paralleli- achieved by performing more suboperations, but by process- zation is done directly by the DBMS are usually faster. ing more data in parallel, as it is possible to see from the However, in this case, it is necessary to make not a mini- RAM occupancy diagram in Fig. 17. It is in Option 3 that mum amount of memory and store duplicate records in more memory is occupied, because more data are processed. different collections on different shards. The last subquery Q3 is also applied to a smaller amount 1 3 152 M. Ha, Y. Shichkina Table 8 Results of testing memory Volume Option Permanent RAM Maximum memory RAM (Kb) (Mb) 1,000,000 Option A—centralized database 976 24.9 Option B—distributed database 1972 96.5 with sharding without replica- tion Option C—distributed database 2103 204.2 with sharding and replication 1 3 Translating a Distributed Relational Database to a Document Database 153 Fig. 17 Graph of used memory changes during query execution Option C- distributed Aggregation data from shard1 database with sharding and Aggregation data from shard2 replication Union result Execute of general query 1 2 3 45 6 78 910111213 Time(sec.) Option B- distributed database with sharding Aggregation data from shard1 without replication ag gregation data from shard2 Union result Execute of general query Time(sec.) 1 2 3 45 6 78 910111213 Option A- centralized database Time of execute query 3 6 Time(sec.) 1 2 45 78 910111213 Fig. 18 Diagram of execution of operations of a query on a database of 1000 000 records Acknowledgements Not applicable. Funding This work was supported by the Ministry of Science and Higher Education of the Russian Federation by the Agreement No 075-15-2020-933 dated 13.11.2020 on the provision of a grant in the Author Contributions Conceptualization, methodology, formal analy- form of subsidies from the federal budget for the implementation of sis, writing—review and editing, project administration were per- state support for the establishment and development of the world-class formed by YS; MH helped in software, validation, formal analysis, scientific center «Pavlov center «Integrative physiology for medicine, original draft preparation, visualization. All authors have read and high-tech healthcare, and stress-resilience technologies» agreed to the published version of the manuscript. 1 3 154 M. Ha, Y. Shichkina Data Availability Data available upon request. 11. Jia T, Zhao X, Wang Z, Gong D, Ding G (2016) Model transfor- mation and data migration from relational database to MongoDB. In: IEEE BigData, pp. 60–67 Declarations 12. Varga V, Jánosi-Rancz KT, Kálmán B (2016) Conceptual design of document NoSQL database with formal concept analysis. Acta Conflict of interest The authors declare that they have no known com- Polytech Hung 13(2):229–248 peting financial interests or personal relationships that could have ap- 13. Shaikh NF (2017) Data migration from SQL To Mongodb. Int J peared to influence the work reported in this paper. Eng Res Dev 13(11):32–36 14. Kim HJ, Ko EJ, Jeon YH, Lee KH (2018) Migration from RDBMS Open Access This article is licensed under a Creative Commons Attri- to column-oriented NoSQL: lessons learned and open problems. bution 4.0 International License, which permits use, sharing, adapta- In: Lee W, Choi W, Jung S, Song M (Eds.), Proceedings of the 7th tion, distribution and reproduction in any medium or format, as long International Conference on Emerging Databases. Lecture Notes as you give appropriate credit to the original author(s) and the source, in Electrical Engineering, vol 461. Springer, Singapore. https:// provide a link to the Creative Commons licence, and indicate if changes doi.or g/1 0.100 7/9 78-981 -1 0-652 0-0_3, https://l ink.s pring er.c om/ were made. The images or other third party material in this article are chapt er/ 10. 1007/ 978- 981- 10- 6520-0_ 3# citeas included in the article's Creative Commons licence, unless indicated 15. Lawrence R (2014) Integration and virtualization of relational otherwise in a credit line to the material. If material is not included in SQL and NoSQL systems including MySQL and MongoDB. the article's Creative Commons licence and your intended use is not In: 2014 international conference on computational science and permitted by statutory regulation or exceeds the permitted use, you will computational intelligence, Las Vegas, NV, USA, 2014, pp. 285– need to obtain permission directly from the copyright holder. To view a 290 https:// doi. org/ 10. 1109/ CSCI. 2014. 56 copy of this licence, visit http://cr eativ ecommons. or g/licen ses/ b y/4.0/ . 16. Hsu MJC, Hsu CH, Chen SC, Chung YC (2014) Correlation aware technique for SQL to NoSQL transformation. In: 7th international conference on ubi-media computing and workshops (UMEDIA). IEEE Press, Ulaanbaatar, pp. 43–46 References 17. Zhang H, Wang Y, Han J (2011) Middleware design for integrat- ing relational database and NOSQL based on data dictionary. In: 1. Ha M, Shichkina Y (2020) Creating collections without embed- Proceedings 2011 international conference on transportation, ded documents for document databases taking into account the mechanical, and electrical engineering (TMEE), Changchun, queries. Computation 8(2):45. https://doi. or g/10. 3390/ com put atio China, pp. 1469–1472 https://doi. or g/10. 1109/ TMEE. 2011. 61994 n8020 045 85 2. Shichkina Y (2015) Approaches to speed up data processing in 18. Ma K, Abraham A (2013) Toward lightweight transparent data relational databases. Procedia Comput Sci 150:131–139 middleware in support of document stores. In: 2013 third world 3. Shichkina Y, Kupriyanov M, Al-Mardi M (2016) Optimization congress on information and communication technologies (WICT algorithm for an information graph for an amount of communica- 2013), Hanoi, Vietnam, 2013, pp. 253–257 https:// doi. org/ 10. tions, Lecture Notes in Computer Science (including subseries 1109/ WICT. 2013. 71131 44 Lecture Notes in Artificial Intelligence and Lecture Notes in Bio- 19. Adriana J, Holanda M (2018) NoSQL2: SQL to NoSQL data- informatics), 9870 LNCS, c. 50-62, ISSN: 03029743 ISBN: 978- bases. In: Rocha Á, Adeli H, Reis L, Costanzo S (eds) Trends and 331946300-1, https:// doi. org/ 10. 1007/ 978-3- 319- 46301-8_5 advances in information systems and technologies. WorldCIST’18 4. Karnitis G, Arnicans G (2015) Migration of relational database to 2018. Advances in intelligent systems and computing, vol 746. document-oriented database: structure denormalization and data Springer, Cham transformation. In: 2015 7th international conference on compu- 20. QueryMongo. Query mongo: Mysql to mongodb query translator, tational intelligence, communication systems and networks, Riga, 2019. URL http:// www. query mongo. com. Latvia, pp. 113–118 https:// doi. org/ 10. 1109/ CICSyN. 2015. 30 21. Russell V (2016) sql-to-mongo-db-query-converter URL https:// 5. Toufik F, Bahaj M (2019) Model transformation from object rela-github. com/ vince ntrus sell/ sql- to- mongo- db- query- conve rter. tional database to nosql document database. In: Proceedings of the 22. Reddy R (2015) Mongodb translator - teiid 9.0 (draft), URL 2nd international conference on networking, information systems https:// docs. jboss. org/ aut hor/ displ ay/ TEIID/ Mongo DB+ Tr ans & security (niss19). Association for computing machinery, New lator York, NY, USA, Article 49, pp. 1–5 https://doi. or g/10. 1145/ 33203 23. Bernstein PA, Goodman N, Wong E, Reeve CL, Rothnie JB (1981) 26. 33203 81 Query processing in a system for distributed databases (SDD-1). 6. Al Mahruqi RS, Alalfi MH, Dean TR (2019) A semi-automated ACM Trans Database Syst 6(4):602–625. https://d oi.o rg/1 0.1 145/ framework for migrating web applications from SQL to document 319628. 319650 oriented NoSQL database. In: Proceedings of the 29th annual 24. Moritz D, Halperin D, Howe B, Heer J (2015) Perfopticon: visual international conference on computer science and software engi- query analysis for distributed databases. Comput Graph Forum. neering (CASCON '19). IBM Corp., USA, pp. 44–53https:// doi. org/ 10. 1111/ cgf. 12619 7. Stanescu L, Brezovan M, Burdescu DD (2016) Automatic map- 25. Dai J (2019) SQL to NoSQL: what to do and How. IOP Conf ping of MySQL databases to NoSQL MongoDB. In: 2016 FedC- Ser Earth Environ Sci 234:012080. https://doi. or g/10. 1088/ 1755- SIS, pp. 837–8401315/ 234/1/ 012080 8. Zhao G, Lin Q, Li L, Li Z (2014) Schema conversion model of 26. Wu S, Li F, Mehrotra S, et al. (2011) Query optimization for mas- SQL database to NoSQL. In: 2014 Ninth 3PGCIC, pp. 355–362 sively parallel data processing. In: 2nd ACM symposium on cloud 9. Freitas MC, Souza DY, Salgado AC (2016) Conceptual mappings computing. ACM, p. 12 to convert relational into NoSQL databases. In: Proceedings of the 27. Xu Y, Hu S (2013) QMapper: a tool for SQL optimization on hive 18th ICEIS using query rewriting. In: 28th international conference on world 10. Karnitis G, Arnicans G (2015) Migration of relational database to wide web companion. pp. 211–212 document-oriented database: structure denormalization and data 28. Lee C, Zheng Y (2015) Automatic SQL-to-NoSQL schema trans- transformation. In: 2015 7th ICCICSN, pp. 113–118 formation over the MySQL and HBase databases. In: IEEE inter- national conference on consumer electronics-Taiwan, pp. 426–427 1 3 Translating a Distributed Relational Database to a Document Database 155 29. https:// neo4j. com/ devel oper/ guide- impor ting- data- and- etl/ 32. Ha VM, Shichkina YA, Kostichev SV (2019) Determining the 30. Park Y, Shankar M, Park B-H, Ghosh J (2014) Graph databases composition of collections for key-document databases based on for large-scale healthcare systems: A framework for efficient data a given set of object properties and database queries. Computer management and data services. In: IEEE 30th international con- Tools in Education, no. 3, pp. 15–28. https:// doi. org/ 10. 32603/ ference on data engineering workshops. Chicago, IL, USA, pp. 2071- 2340- 2019-3- 15- 28 (in Russian) 12–19 31. De Virgilio R, Maccioni A, Torlone R (2013) Converting rela- tional to graph databases. In: First international workshop on graph data management experiences and Systems. ACM 1 3 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Data Science and Engineering Springer Journals

Translating a Distributed Relational Database to a Document Database

Data Science and Engineering , Volume 7 (2) – Jun 1, 2022

Loading next page...
 
/lp/springer-journals/translating-a-distributed-relational-database-to-a-document-database-Kx0BYSwx5Y

References (29)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2022
ISSN
2364-1185
eISSN
2364-1541
DOI
10.1007/s41019-022-00181-9
Publisher site
See Article on Publisher Site

Abstract

The problem of transferring data from a database of one type to a database of another type is relevant today due to the increase in data volumes and the complexity of tasks solved in various fields of human activity. The existing databases are less and less satisfying the needs of users. New types of databases appear that are more suitable for solving large-scale problems. This article proposes an algorithm for solving the problem of optimizing the document structure of a database based on metadata about the structure of a distributed relational database from which data are transferred. The approach also takes into account information about the structure of the priority database queries. The priority of database queries is user-defined. The system of automatic translation of the database, taking into account these metadata, allows the user to create a distributed document database that is optimal in two parameters: in terms of the volume of stored data and in terms of the execution time of priority database queries. Keywords NoSQL · Database query · Collection · Document · Database structure optimization · Embedded documents 1 Introduction the complexity will increase by an order of magnitude, and the characteristics that make relational databases so attrac- Over the past decades, the variety of database applications tive rapidly reduce to zero the chances of using them as a has increased, and with them the importance of database platform for large distributed systems. features such as simplicity, robustness, flexibility, perfor - The solution to this problem is to transfer systems to other mance, scalability, and compatibility has increased. And types of databases that have a higher scalability, but with the with the increase in the number of databases, one feature loss of other capabilities available in relational databases. becomes more important than all the others. It is scalability. It must be said that this solution has a number of difficul - As more and more applications run under high load condi- ties: the choice of a database type suitable for an applied tions, their scalability requirements can change and grow task, in the case of a document database or a database of the very quickly. Relational databases scale well only if they are column family type; choice of a global database structure in located on a single server. When the resources of this server an existing distributed system and a local structure of objects run out and it becomes necessary to increase the number of in a database, transferring data from an existing database to computational nodes and distribute the load between them, a new type of database. then problems arise with the growth of the complexity of The problem is that document database management sys- relational databases, which directly depends on the amount tems do not perform " join" operations well, so it is impossi- of scalability. So, if there is an increase in the number of ble to translate a distributed relational database into a docu- servers not to a few, but to a hundred or a thousand, then ment database format on a one-to-one principle. The use of this principle will lead to the opposite effect: a decrease of performance due to complication of queries and their partial * Yulia Shichkina implementation in client applications (those parts of queries strange.y@mail.ru that cannot be implemented by a document DBMS, e.g., the Muon Ha "join" operation). Currently, there is no formalized way to muon.ha@mail.ru build an optimal distributed document database structure based on relational database metadata for translating data Faculty of Computer Science and Technology, Saint from one format to another. In this paper we propose a solu- Petersburg Electrotechnical University “LETI”, St. Petersburg, Russia tion to this problem. Vol:.(1234567890) 1 3 Translating a Distributed Relational Database to a Document Database 137 In earlier articles, we considered the problems associated specific subject area. The third section provides a methodology with the optimization of the structure of non-distributed doc- for applying methods of optimizing the structure of document ument databases and databases such as a family of columns, databases, taking into account the structure of planned queries taking into account the structure of the planned queries [1]. for the case of a distributed database. The fourth section pre- We also described the methodology for applying methods sents the results of testing the proposed methodology on data- for optimizing the structure of a database, depending on the bases of various sizes with queries of various structures and stage at which this optimization is carried out: at the initial degrees of their embeddedness. In the conclusion, the main stage of creating a new database, at the stage of translating results of the research are summarized. data from relational databases to NOSQL databases, or when consolidating databases of various types. But, all these solu- tions concerned only centralized databases.2 Overview of Related Works Currently, due to the large increase in data volumes, dis- tributed databases are increasingly used. Therefore, in this The problem of translating relational databases into NoSQL article, we show how methods designed for centralized data- is relevant and solved by many researchers. The main dif- bases would work when used to translate distributed rela- ficulty in solving this problem is the difference in data struc- tional databases to distributed NOSQL databases. According tures in relational databases (RDBs) and NoSQL. to our methodology described in [xxx], these methods can The paper [4] offers a solution for quick data migration also be used when creating new document databases. from a relational database into a document-oriented data- The proposed methods are based on graph theory, set base. Authors have created semi-automatically two logical theory, and parallel computing theory: levels over physical data. Users can refine generated logical data model and configure data migration template for each – Elements of set theory allow to formalize the process of needed document. Data migration features are implemented choosing the optimal structure of documents in collec- into relational database browser Dig Browser. But, the docu- tions of the MongoDB database, taking into account the ment does not present solutions to the problems of optimiz- architecture of the computing system, the set of elements ing the database schema and database queries. stored in the database, and queries to the database that The paper [5] describes some approaches of migration are most often executed or require the highest execution and proposes an approach of model transformation from speed. This is the first necessary step to achieve an opti- object relational database to NoSQL document database mal balance between the amount of stored data and the (MongoDB). speed of execution of queries to the database. In article [6], the authors describe a semi-automated – Elements of graph theory allow visualizing and formal- approach to migrate highly dynamic SQL-based web appli- izing the query execution plan depending on the architec- cations to ones that use document-oriented NoSQL data- ture of the computing system and the database schema. bases such as MongoDB using source analysis and transfor- – Elements of the theory of parallel computing allow to mation techniques. Authors demonstrate our semi-automated formalize the process of optimizing the query execution framework on the analysis and migration of three existing plan. We use the theory of parallel computing based on web applications to extract, classify, translate, migrate, and information graphs [2, 3]. optimize the queries and migrate the PHP code to interact with the migrated database. Authors approach provides Thus, our proposed approach to translating a distributed migrating and optimizing the embedded SQL queries to relational database into a distributed document database is interact with the new database system and changing the based on sequential steps: optimizing the document database application code to use the translated queries. But, this schema based on metadata about the relational database, build- approach does not take into account distributed database ing query execution plans, optimizing query execution plans, schemas and query optimization based on this schema. translating queries from SQL format to MongoDB format. Approaches to the translation of databases from the RDB The article is structured as follows. The second section pro- format to the format of documentary NoSQL databases are vides an analysis of existing studies close to the topic of our described in various publications [7–12]. Articles [7, 8] research. This section shows that despite a huge number of describe automatic solutions in which the input data are publications on the topic of relational databases and NOSQL RDB metadata and ER diagrams. In [8], an algorithm is and their mutual transfer from one format to another, there is presented that takes into account the dependencies between no research devoted to formalized methods of data transla- tables and the number of primary keys and foreign keys in tion in distributed databases. There are separate publications the conversion process. In [9], an algorithm is presented that in which the authors talk about their experience of translating uses dependencies between tables, but data embeddedness data in distributed databases to solve problems in a narrow is used only in relation to a foreign key to a primary key. In 1 3 138 M. Ha, Y. Shichkina article [7], tables are classified according to four types: main, interact with any software supporting JDBC. A virtual- subclass, related, and general. The algorithm presented by ization system is built on top of the NoSQL sources that the authors uses this classification to include subclass and translates SQL queries into the source-specific APIs. The general tables in the main table. Table of relationships is virtualization architecture allows users to query and join data converted by using the links. This research mainly focuses from both NoSQL and relational SQL systems in a single on transforming a relational concept (e.g., table, relation- SQL query. ships, etc.) into a NoSQL (document database) concept In [16], it is proposed to analyze log files to determine (document, subdocuments, etc.). The research proposed which tables are frequently involved in a query with the by the authors [10] has a different classification, consisting “join” operator. The authors suggest placing these tables on of four classes: codifier, simple entity, complex entity, and the same node, if possible. This makes it possible not to use N:N-relationship. In addition to classification, the user must the "join" operator. provide a “focused table” that represents the target NoSQL In general, the task of translating queries between the entity. The algorithm creates a tree with related tables. In types of RDB and NoSQL databases is not a solved prob- [11], the authors propose the use of tags to control the trans- lem when translating databases. Analysis of publications lation of databases. The user tags the ER diagram with tags shows that the following directions of its solution can be that represent the characteristics of the data and queries. distinguished: Based on these tags, the algorithm decides to use embedded documents or links in the conversion. - Development of middleware software that executes SQL The authors [12] used formal concept analysis, con- commands to process data for NoSQL databases. This ceptual scaling, and relational concept analysis to create a approach was proposed in [17–19]. conceptual model that helps users define the structures of - Translation of queries from SQL to NoSQL format. For a NoSQL database. This model provides translation rules example, [20] is a web translator that takes a SQL query for all types of relationships between relational data tables and generates an equivalent MongoDB query. The trans- (1–1, 1-n, n–n). lation is based solely on the syntax of the SQL query. The In connection with the increase in the volume of data, approach does not take into account any data or schemas. there has recently been an increased attention to the research There is no explanation for the approach to translation. of translation processes and optimization of queries when Russell [21] describes a library that provides an API for translating data from one format to another. translating SQL queries to MongoDB. The translation The document [13] proposes an application with a graphi- is based on SQL query syntax only. The system from cal user interface for data transfer and automatic query con- [22] requires the user to provide a MongoDB schema version. The system that authors propose can be subdivided expressed in relational form using tables, procedures, into an online database application and a query conversion and functions. utility. The client application acts as an environment that allows the users to select and convert the databases from Among the studies directly related to the optimization of SQL to MongoDB. The article also discusses the structure, queries to relational databases, it is possible to single out data types, and keys. The query conversion utility provides [23]. This document describes the methods used to optimize the user with graphical user interface that allows him to relational queries in an SDD-1 distributed database system. choose from some basic predefined SQL queries or write his Queries are submitted to SDD-1 in a high-level procedural own SQL query. This way the user can implement his que- language called Datalanguage. ries even if he is not familiar with the MongoDB database. The following papers describe solutions for query The article [14] is devoted to the problems of transla- performance: tion of queries from a distributed relational database into The article [24] describes the Perfopticon system, which NOSQL. This article discusses column-oriented NoSQL is an interactive query profiling tool. This tool allows you DBMS, HBase, because it is widely used by many Inter- to quickly identify performance bottlenecks and data skew. net enterprises such as Facebook, Twitter, and LinkedIn. Perfopticon combines interactive visualizations of (1) query Because HBase does not support SQL, authors use Apache plans, (2) overall query execution, (3) data flow among serv - Phoenix as an SQL layer on top of HBase. The authors also ers, and (4) execution traces. These views coordinate mul- confirm that important unsolved problems are supporting tiple levels of abstraction to enable detection, isolation, and complex SQL queries, automatic index selection, and opti- understanding of performance issues. mizing SQL queries for NoSQL. The article [25] provides an overview of query optimiza- In article [15], authors present a generic standards-based tion systems. In particular, it was considered that AQUA architecture that allows NoSQL systems, with specific focus [26] is a query optimizer which aims at reducing the cost on MongoDB, to be queried using SQL and seamlessly of storing intermediate results by optimizing the order of 1 3 Translating a Distributed Relational Database to a Document Database 139 join operations. QMapper [27] is an intelligent translator 3 Methods on Which Distributed Database to enable automatic rule-based SQL-to-HiveQL mapping Translations are Based as well as cost-based optimizing of the translated queries. There are also a number of articles devoted to translat- The method for optimizing the schema of document ing databases into other types of NoSQL databases. Among databases taking into account the structure of executed them are the following. queries for the case of a distributed database is based on The approach described in [28] is for creating data struc- similar methods for centralized databases described in tures for column-oriented databases such as HBASE. The [32] and methods for optimizing the graph of information authors propose to analyze the primary key and foreign key dependencies. of relational tables and then create a large table that stores The essence of the methods for optimizing the structure all related information. The main focus of this approach is of document databases, taking into account the structure of to create a key for a large table. The authors [11] also formu- executed queries for centralized databases, consists in the lated some rules for converting a table with one embedded following postulates: and several embedded DBMS tables to a column-oriented HBase database. 1. It is necessary to present a relational database in the The easiest way to translate a relational database into a form of a single set of fields, and all queries are also in graph database is to simply convert each table record to a the form of a single set of fields. Then, by iteratively node, grouping the set of nodes by label name. Foreign keys applying set subtraction and set intersection operations, between tables are converted into graph edges connecting it is possible to obtain collections of fields that will cor - two nodes [29]. The authors of [30] proposed the transfor- respond to the collections in the document database. mation of the 3NF-form of the RDB into the form of the 2. It is necessary to take into account during these opera- 3EG graph model of the database. The authors used a rela- tions the presence of types of links between tables of tional database that exists in 3NF as input data and proposed the relational database. Then, based on the collections four rules for transforming relational database tables into a obtained from the results of the first postulate, it is pos- graph database. sible to optimize the structure of the document database The authors of [31] also proposed an algorithm for con- by creating embedded documents. verting a relational database into a graph database. They combine key values from more than one row in a single These two methods of optimizing the structure of a docu- node, so that when a user needs relevant information, he can ment database without and with embedded documents are get it by visiting only one node. They define some rules for described in detail in [2]. different cases of grouping information from different rows The aim of the current research is to find approaches to of a table in one node. They created a description of the full account for the structure of a distributed database for accel- path along the graph, which allows to go from the source erated query execution. node to the destination node, and find information from a A database query is a set of instructions that describe the given node, which must be duplicated in another node. order of actions of the executor to achieve a certain result. Our analysis shows that the problem of translating data After the development of parallelism in the operation of from one format to another is currently a very actual prob- computers, the word "sequence" began to be replaced by lem. However, a good solution to this problem has not yet the more general word "order." It is possible to say that a been found. Many researchers are looking for a solution. query to a database is a determination of the order and con- There are specific solutions for individual areas, for direct ditions for selecting data, and the process of creating a query translation of tables into collections, for transferring RDBs is a decomposition of a query into elementary subqueries. to document databases, taking into account the types of rela- An elementary subquery can be understood as a construct tionships between tables, for extracting data from document «Select» without embedded subqueries. If a query contains databases using SQL queries. However, the researchers have an elementary query (subquery) inside it, then it is called the not yet proposed a general approach that would take into parent, and the subquery is called the child. account the database schema, the computing system archi- An information graph is further understood as a directed tecture, and the query structure. Our research is devoted to acyclic multigraph, the vertices of which correspond to finding such a general approach. elementary queries within the main query, and the edges This article proposes a formalized approach that takes correspond to data transfer between them. into account RDB schemas, SQL query schemas, and shard- The principle of efficient organization of parallel query ing and replication schemes to create a distributed document processing implies the search for subqueries that are inde- database. pendent of each other according to the data. When creating 1 3 140 M. Ha, Y. Shichkina a new query model, it is important that query properties 4. Adaptation of the graph for horizontal sharding for the such as computational accuracy and query robustness are case when the table "T" of the RDB is distributed into k preserved. For this, it is important to know which subqueries shards: have input data that result from the execution of other sub- a. Create k input vertices of the graph, each of which queries. These links are reflected in the information graph. will correspond to a certain part of the table on a An important role in this is played by the concept of a paral- certain shard: T , i = 1…k. lel graph form. b. If table T was connected by an edge with query Q, When modifying the parallel form of the information then create k graph vertices, each of which will cor- graph, the structure of the graph does not change, but only respond to query Q executed for a certain part of the its projection on the plane. Due to this, a new query created table on a certain shard: Q , i = 1…k. from the modified parallel form of the information graph is equivalent to the original query. But, with this approach, Further, one of three options is possible: a global extremum may not be achieved in optimizing the Option 1. solution to the problem in terms of certain parameters, for example, in time. This approach will find the most optimal i. Connect all the vertices that were previously con- query among all equivalent queries, taking into account data nected to the vertex Q with the vertices Q , i = 1…k; relationships between subqueries. ii. Connect each vertex corresponding to a part of the To construct a parallel plan of queries in the first approxi - table T with a vertex corresponding to the query Q mation, it is possible to use the parallelization method of and executed for this part of the table T, i.e., the T part classical algorithms, based on the adjacency lists of the must be connected to Q (an example of this option is information graph of the algorithm [14]. Unlike classical shown in Fig. 1b); algorithms, an information query graph is created much iii. If the vertex Q was not the output vertex of the graph, easier and has much fewer vertices. then: create a vertex Q’, which will correspond to the Our research has shown that all methods for optimizing query on data aggregation from queries Q over parts an information graph for various parameters (time, compu- of the table T . tational nodes, volume of interprocess transfers, etc.), devel- oped for classical algorithms, are completely suitable for Option 2. queries. The article [32] shows how it is possible to modify the database query information graph in order to increase i. Connect each vertex corresponding to a part of the the degree of query parallelism. table T with a vertex corresponding to the query Q and executed for this part of the table T, i.e., the T part must be connected to Q (an example of this option is 4 Creating a Document Database Schema shown in Fig. 1c); Taking into Account the Information ii. If the vertex Q was not the output vertex of the graph, Graph of the Database Query then: create a vertex Q’, which will correspond to the query on data aggregation from queries Q over parts In our study, we used an approach based on information of the table T ; graphs to find the dependences of subqueries on the location iii. If there are vertices (except T) that were previously and data availability in accordance with a given distributed connected to the vertex Q, then create a node Q’’ and database schema and information about shards and replicas. connect all the vertex and the vertex Q’ with the ver- A method for constructing an information query graph for tex Q’’; a distributed relational database taking into account infor- mation about table fragmentation. Option 3. 1. To each RDB table put in correspondence the input ver- i. Connect all the vertices that were previously connected tex of the information graph. to the vertex Q with the vertices Q , i = 1…k; 2. Associate each elementary subquery with the vertex of ii. Connect each vertex corresponding to a part of the table the information graph. T with a vertex corresponding to the query Q and exe- 3. Connect the vertices of the information graph with cuted for this part of the table T, i.e., parts T must be directed edges in accordance with the rule: if subquery connected to Q (an example of this option is shown in A receives data from table (or subquery) B, then connect Fig. 1d); vertices A and B with a directed edge from B to A. 1 3 Translating a Distributed Relational Database to a Document Database 141 Fig. 1 Information graphs of a query to the database: a initial graph, b graph corresponding to 12 option 1, c graph corresponding 1 to option 2, d graph correspond- Q Q’ Q Q 11 2 1 2 ing to option 3 (a) (b) Т Q 12 12 12 22 Q’ Q Q’’ Т Q’ 11 Q 11 Q Q 11 21 2 2 Т Т (d) (c) iii. If the vertex Q was not the output vertex of the graph is used, taking into account information about fragmentation b and was connected by an edge to the query Q , then of tables with modification. The modification consists in not create k vertices of the graph, each of which will cor- doing sub-steps related to data aggregation in step b. This is respond to the query Q , executed for a certain part of step iii for options 1 and 2, step v for option 3. the query Q on a certain shard: Q , i = 1…k. Note. The choice of option 1–3 depends on the types of i 1i iv. Connect each vertex corresponding to the part of the operators that are in the database query. So, for example, for query Q with the vertex corresponding to the part of the a database query: query Q , i.e., connect Q with Q ; 1 i 1i v. If the vertex Q was not the output vertex of the graph, Select f1 from (select f1, f2 from t1, t2 where t1.key = t2. then: create a vertex Q’’ that will correspond to the key and condition) as Q, t3 where Q.f1 = t3.key and con- query on data aggregation from queries Q ; dition; 1i c. If there are more tables in the graph that are also distrib- uted across shards, then step b) is repeated for each such option 3 will be optimal for execution. table. To query the database: An example of creating an information query graph for Select f1 from t3 where f1 > (select count(*) from t1, t2 the case of a distributed database. Let there be a database where t1.key = t2.key and condition) and condition; query information graph (Fig. 1a). Let the table T1 be dis- tributed over 2 shards. Then the information graph of the option 2 will be optimal for execution. database query, taking into account the sharding scheme, To query the database: can have one of the forms shown in Fig. 1b–d according to options 1–3 described in the method. Select f1 from t3 where f1 in (select count(*) from t1, t2 Note. For the case of presence of replicas of tables on where t1.key = t2.key and condition) and condition; different shards, the same method of constructing an infor - mation graph of a query for a distributed relational database option 3 will be optimal for execution. 1 3 142 M. Ha, Y. Shichkina A method for constructing a document database schema Fig. 2. Table T1 is the main table in the link between tables taking into account the structure of a distributed relational T1 and T2 (Fig. 2). Table T2 is the main table in the link database: between tables T2 and T3. Consider a query that finds the number of values in field 1. Design an information graph without sharding and rep- T1.A that correspond to values in field F that are a multiple lication. of 5 (listing 1): 2. Transform the information graph to a parallel form that Listing 1 minimizes the transfer of data from shard to shard using Select SQL_NO_CACHE count(*) From t1 where A the method of optimizing parallel algorithms by the in (Select A from t2 where D in (Select D from t3 where number of communications [30–32]. F%5=0)); 3. Perform step 1–2 for all queries, whose structure should be taken into account when optimizing the document database schema in order to speed up the execution of SELECT SQL_NO_CACHEcount(*) these queries. 4. Based on the metadata about tables, fields, queries and FROM t1 Q1 relationships using methods of optimization of the docu- WHERE A IN ment database schema with nested or without embedded documents, form the structure of document collections. 5. Improve the structure of documents, taking into account SELECT A the graphs of information dependencies of queries, mini- FROM t2 Q2 mizing the number of data joining operations. WHERE D IN 5 Testing the Approach to Translating SELECT D Distributed Databases into Document FROM t3 Q3 Databases WHERE F%5 = 0 To test the approach to translating distributed databases into document databases, a test relational database of three tables (DB_TEST) was created, the structure of which is shown in Fig. 3 Subqueries in the original query Fig. 2 The structure of the test database 1 3 Translating a Distributed Relational Database to a Document Database 143 T3 Q1 Q2 Q3 T1 T2 T2 T1 T3 (Partition 2) T3 (Partition 1) Fig. 4 Option A: information graph of a query to a centralized rela- Shard1 tional database Shard2 Fig. 5 Database schema DB_TEST, in which table T3 is distributed This query works with three tables. If the query structure over two shards shown in Fig. 3. highlight subqueries Q1, Q2, Q3,it is pos- sible to design the information graph of the query for the case of a centralized database (Fig. 4). Select D as Q1_1 from t3 where F%5 = 0 // This is a sub- The following describes testing a query to a MongoDB query Q1 to the first part of table T 3. The first part of the database without embedded documents based on a schema table T3 is indicated on the graph: T3 (Partition 1); with sharding and replicas and in accordance with option Select D as Q1_2 from t3 where F%5 = 0 // This is a sub- 1 of the method for constructing an information query query Q1 to the second part of table T3. The second part graph. of the T3 table on the graph is designated as T3 (Parti- The first step was to define the structure of the collection tion 2); in MongoDB by the method of optimizing the document database schema without embedded documents. In order to execute query Q2 on the results of query Q1, Three tables were used as input data: T1(A, B, C), T2 (D, which consists of two parts, it is necessary to combine these A, E), T3(D, F, G). parts of the result into a single part. For this, another sub- The output data—the document database collections: query Q4 is created, which is executed in the application and corresponds to the code: V1 = {T1.B, T1.C, T2.E, T3.G} V2 = {T1.A, T2.D, T2.A, T3.D, T3.F} Select * from Q1_1 union Select * from Q1_2 Select A from t2 where D in Q4 // This is a subquery Q2 Further, there is an option for this database in a dis- to the results of query Q1 and table T2; tributed form with the presence of sharding, but without Select SQL_NO_CACHE count(*) From t1 where A in replications. Figure 5 shows the schema of the DB_TEST Q2 // This is a subquery Q3 to the results of query Q2 database, in which the T3 table is distributed over two and table T1. shards. In this case, a horizontal fragmentation scheme is used, when records from table T3 contain values for all Note: that the information graph can be different. Optimi- fields of table T 3 and are distributed among different shards zation methods for the classical algorithm represented by an according to some rule, for example, according to a hash information graph are described in [32]. key. The case of vertical fragmentation can be thought of If we add a replica for one of the tables to option C of as dividing one table into several projections in the sense a distributed relational database, for example, for table T2 of relational algebra, linked by a one-to-one relationship. (Fig. 7), then a possible information graph of the query from Therefore, this case is a special case of a centralized data- Listing 1, executed on DB_TEST, can be as in Fig. 8. base and is not considered by us when looking for a formal- In this variant, the original query is divided into subquer- ized approach to translating distributed databases from one ies, which are executed in parallel to shards, including a format to another. shard with a replica: Figure 6 shows a possible information graph for a data- base query from Listing 1, executed on a database with the Select D from t3 where F%5 = 0 // This is a query Q1 for schema in Fig. 5. This graph corresponds to the query to the the first part of table T3; distributed database without replication, but with sharding. Select D from t3 where F%5 = 0 // This is a query Q1 to The original query (Listing 1) has been split into subqueries the second part of the table T3; that can be executed in parallel on shards: 1 3 144 M. Ha, Y. Shichkina Fig. 6 Option B: information graph of a query to a distributed Q1 relational database with several T3 (Partition 1) shards, but no replications Q4 Q2 Q3 union Q1 T3 (Partition 2) T2 T1 Fig. 7 Database schema DB_TEST, in which table T2 has a replica Select A from t2 where D in Q1 // This is a query Q2 to To test these three options A, B, C, MariaDB was used the results of query Q1 on the first part of table T 3 and with the Spider storage engine, which allows you to create to table T2; distributed databases from a standard MariaDB installation Select A from t2_copy where D in Q1 // This is a query Q2 and ensures data consistency when sharded. The table T3 to the results of query Q1 on the second part of table T3 was sharded in accordance with the hash algorithm on the and to table T2; field F. Select SQL_NO_CACHE count(*) From t1 where A in Q4 // This is a Q3 query to the results of Q4 query, which combines the results of previous queries, and to table T1. 1 3 Translating a Distributed Relational Database to a Document Database 145 Fig. 8 Option C: information graph of a query to a distributed T2' relational database with several shards and replicas Q1 Q2 T3 (Partition 1) Q4 Q3 union Q1 Q2 T3 (Partition 2) T1 T2'’ Table 1 Volumes of the test database DB_TEST calculated from the moment the query was sent until the moment the results were returned. T1 T2 T3 - For variants B–C of the DB schema, the program exe- Volume 1 1000 3000 9000 cutes sub-queries to shards in parallel. Then the program Volume 2 10 000 30 000 90 000 combines the results obtained from the shards according Volume 3 100 000 300 000 900 000 to the principle of the "Union" mechanism and sends the Volume 4 1000 000 3 000 000 9 000 000 result to the input of the main query. The processing of a database query for variants B–C of the database schema can be represented as follows (Fig. 9): To test the database DB_TEST, data were generated Table  3 shows the query execution time (in seconds) in four different volumes, the sizes of which are shown in Table 1. using the developed program: In Fig. 10 shows an example of the program execution Table 2 shows the volumes of the distributed database DB_TEST in accordance with the diagram in Fig. 5. result for the case of a distributed database structure with a volume of 9,000,000 records. The table T2_copy is a Node2 replica of the table T2 from Node1. Query execution time graphs for different variants of database schemas and data volumes are shown in Fig. 11. Since MariaDB with the Spider storage engine does not support a function for making database queries to individ- Figure 11 shows that for small amounts of data, a query to all three variants of the database scheme works almost ual shards, a separate program was created for testing. This program is developed in the Python programming language the same, but with an increase in data volumes, the time difference between a sequential query and a parallel query with the multiprocessing library. The way in which the program performs a database query increases. In this case, the fastest query is performed to the database, which contains not only sharding, but also on the MariaDB database depends on the variant of the data- base structure A-C. replication. Testing the approach to translating a database with - For variant A of the DB_TEST database schema (not embedded documents, taking into account the structure of the information graph, is shown below. distributed), the execution of the program consisted in the direct execution of the query, and the execution time is Table 2 Volumes of the Node 1 Node 2 distributed database DB_TEST T1 T2 T3 T1 T2_copy T3 Volume 1 1000 3000 4500 Heт 3000 4500 Volume 2 10,000 30,000 45,000 Heт 30,000 45,000 Volume 3 100,000 300,000 450,000 Heт 300,000 450,000 Volume 4 1000,000 3000,000 4500,000 Heт 3000,000 4500,000 1 3 146 M. Ha, Y. Shichkina Execution subquery on Node 1 -Connecting to Server 1onNode1 “UNION” of Execution of Start Finish -Connecting to Server results general query 2onNode2 Execution subquery on Node 2 Fig. 9 Processing database query for variants B–C database schema Table 3 Query execution time In this testing, Option 1 or Option 2 of the information query graph method using an example of a database with Volume 1 Volume 2 Volume 3 Volume 4 embedded documents was evaluated to be faster. Option 1 0.00498 0.0651 0.708 6.37 A distributed database according to the scheme of the Option 2 0.001966 0.01898 0.207 3.655 second variant of the method for constructing an information Option 3 0.00097 0.00898 0.1027 1.351 graph of a database query is shown in Fig. 12: Note. In this case, duplication of fields was not done because the union of the results is done after the first sub- The first step was to determine the structure of the col- query is executed. Translation of subqueries Q1, Q2, Q3 for option B lection in MongoDB using the method of optimizing the schema of a document database with embedded documents. (Fig. 6) from SQL format to MongoDB format was carried out taking into account the execution plan of the relational Three tables were used as input data: T1(A, B, C), T2 (D, A, E), T3(D, F, G). query. An example of a Q1 subquery obtained as a result of translation from SQL to MongoDB format for one shard is The output data—the document database collections: shown below: V1 = {T1. B, T1.C, T2_of_T1 {T2.E, T3_of _T2 ∶ {T3.G}}}. V2 = {T1.A, T2_of_T1 ∶ {T2.A, T2.D, T3_of _T2 ∶ {T3.D, T3.F}}} Fig. 10 An example of program execution 1 3 Translating a Distributed Relational Database to a Document Database 147 Fig. 11 Query execution time for different variants of the data- base schema and data volume Fig. 12 DB schema with embedded documents 1 3 148 M. Ha, Y. Shichkina Note. Because the second subquery works with the results Taking into account the specification of the operation of the first query and with fields from external documents in of the MongoDB DBMS, the Q1, Q2, Q3 subqueries are relation to embedded documents, then it is necessary in this executed in the MongoDB query language, and the query case to duplicate all fields on both shards. to combine the results of the Q1 subqueries to the first part As in variant B, in this variant the subqueries Q1, Q2, Q3 of the T3 table and Q1 to the second part of the T3 table is were translated into the MongoDB query language. Query executed in a Python program. This is because MongoDB Q4 to combine the results of subqueries Q1, Q2 was exe- does not support join operations. The query execution plan cuted by a Python program. is shown in Fig. 13: The Python code for combining query results is: The distributed database according to the scheme of vari- The subquery codes that are executed on each shard are ant C of the method for constructing the information query shown below: graph is shown in Fig. 14: T3_1 Result T1 T2 Union User T3_2 MongoDB MongoDB + Python Fig. 13 Query execution plan using the program 1 3 Translating a Distributed Relational Database to a Document Database 149 Fig. 14 DB schema with embedded documents T2 T3_1 Result T1 Union T2 T3_2 MongoDB+ Python MongoDB Fig. 15 Query execution plan using the program 1 3 150 M. Ha, Y. Shichkina Table 4 Volumes of the test database DB_TEST: The query execution plan is shown in Fig. 15: To test queries for the DB_TEST database, data were T1 T2 T3 generated in four different volumes, the sizes of which are Volume 1 1000 3000 9000 shown in Table 4. Volume 2 10,000 30,000 90,000 Table 5 shows the volumes of the distributed database Volume 3 100,000 300,000 900,000 DB_TEST in accordance with the scheme in Fig. 5. Volume 4 1000,000 3000,000 9000,000 Option 1 is a variant of a centralized database, option 2 is a distributed database with an information query graph according to option 2 of the method for constructing an Table 5 Volumes of database data DB_TEST on relational database information query graph, option 3 is a database with an shards information graph of queries according to option 3 of a method for constructing an information query graph. Option 2 Option 3 Table 6 shows the volumes of the distributed database Node 1 Node 1 Node 2 Node 2 DB_TEST in accordance with the scheme in Fig. 5. Volume 1 1000 1000 849 849 Table  7 shows the query execution time (in seconds) Volume 2 10,000 10,000 8682 8682 using the developed program: Volume 3 10,000 10,000 87,461 87,461 Query execution time graphs for different variants of the Volume 4 1,000,000 1,000,000 874,525 874,525 database schema and data volume are shown in Fig. 16 From the diagram in Fig. 16, it can be seen that the maxi- mum parallelization of a query in the database before com- Table 6 Volumes of database DB_TEST on MongoDB database bining the result works fastest with the application of an shards external program. In our early article [32] we showed that queries to a cen- Volume Option A Option B Option C tralized database with an optimized structure using a set Node 1 Node 2 Node 1 Node 2 theory method are significantly faster than a database built Volume 1 344,7 кб 267 кб 183 кб 213,2 кб 203,9 кб manually without optimizing the structure. The results of Volume 2 3,4 мб 2,61 мб 1,8 мб 2,1 мб 2,0 мб this study show that it is possible to further speed up query Volume 3 33,7 мб 26,12 мб 18,06 мб 20,8мб 20,1мб execution if the system is distributed and the architecture of Volume 4 336,6 мб 261,3 мб 180,69 мб 208,37мб 200,71 мб the distributed system is taken into account when construct- ing queries. So in Figs. 11 and 16 it is possible to see that the centralized database even with an optimized structure Table 7 Time of execution of queries to DB_TEST (Option 1) has the longest execution time. Options 2 and 3 allow you to execute some of the subqueries in parallel and Volume 1 Volume 2 Volume 3 Volume 4 therefore work faster. Option A 0.0464 0.1423 1.297 13.445 It should also be noted that the accelerated version Option B 0.0189 0.112 1.089 11.531 requires more memory to store duplicate data. It is possi- Option C 0.0114 0.0818 0.7502 8.884 ble to see the results of testing all three options for the use of RAM and permanent memory, which are presented in Table 8. 1 3 Translating a Distributed Relational Database to a Document Database 151 Fig. 16 Query execution time for different variants of the data- base schema and data volume The table shows the amount of used memory for two sizes of data, so its execution speed is also higher. In Option 3, of the database. For the other two database volumes, the most of the data is processed at the first stage (execution of picture of used memory is similar. More clearly the use of queries Q1, Q2) in parallel. RAM is shown in Fig. 17. Figure 17 shows that the fastest Option 3 is more memory intensive than Option 2. This is due to the fact that more 6 Conclusion duplicate data are placed in memory to speed up the execu- tion of the query. Nevertheless, Option 3 is much more eco- This article has shown that when creating a distributed docu- nomical compared to Option 1. ment database, there is a need to consider: To assess which operations in the query take the long- est time with different sharding and replication options, the 1. The structure of queries, relationships between tables. test was conducted for a database with a volume of 1000 In accordance with these data, it is necessary to create 000 records. Based on the test results a diagram was built a document database model with or without embedded (Fig. 18). documents. The length of each block on the diagram corresponds to 2. The presence of shards and replicas. The presence of the execution time of separate query operation (Q1–Q4). replicas significantly speeds up the process of execut- The execution time of each operation was measured using ing queries, because allows to maximally parallelize a a specially designed program. Twenty-five tests were car - complex query based on the information graph of the ried out. The diagram shows the average results. From the query. diagram in Fig. 18 it is possible to see that in Option 3 the 3. Internal dependencies and query operations. This deter- execution time of parallel subqueries (Q1–Q2) on the shards mines the way the information graph is presented and, is longer than in Option 2. But due to the fact that most of subsequently, the structure of the document database the work has been parallelized the "join" operation (Q4) in collections. Option 3 is performed for less data and therefore executed 4. Consensus between memory size and query execution faster than in Option 2. Paralleling most of the work is not speed. Database queries in which maximum paralleli- achieved by performing more suboperations, but by process- zation is done directly by the DBMS are usually faster. ing more data in parallel, as it is possible to see from the However, in this case, it is necessary to make not a mini- RAM occupancy diagram in Fig. 17. It is in Option 3 that mum amount of memory and store duplicate records in more memory is occupied, because more data are processed. different collections on different shards. The last subquery Q3 is also applied to a smaller amount 1 3 152 M. Ha, Y. Shichkina Table 8 Results of testing memory Volume Option Permanent RAM Maximum memory RAM (Kb) (Mb) 1,000,000 Option A—centralized database 976 24.9 Option B—distributed database 1972 96.5 with sharding without replica- tion Option C—distributed database 2103 204.2 with sharding and replication 1 3 Translating a Distributed Relational Database to a Document Database 153 Fig. 17 Graph of used memory changes during query execution Option C- distributed Aggregation data from shard1 database with sharding and Aggregation data from shard2 replication Union result Execute of general query 1 2 3 45 6 78 910111213 Time(sec.) Option B- distributed database with sharding Aggregation data from shard1 without replication ag gregation data from shard2 Union result Execute of general query Time(sec.) 1 2 3 45 6 78 910111213 Option A- centralized database Time of execute query 3 6 Time(sec.) 1 2 45 78 910111213 Fig. 18 Diagram of execution of operations of a query on a database of 1000 000 records Acknowledgements Not applicable. Funding This work was supported by the Ministry of Science and Higher Education of the Russian Federation by the Agreement No 075-15-2020-933 dated 13.11.2020 on the provision of a grant in the Author Contributions Conceptualization, methodology, formal analy- form of subsidies from the federal budget for the implementation of sis, writing—review and editing, project administration were per- state support for the establishment and development of the world-class formed by YS; MH helped in software, validation, formal analysis, scientific center «Pavlov center «Integrative physiology for medicine, original draft preparation, visualization. All authors have read and high-tech healthcare, and stress-resilience technologies» agreed to the published version of the manuscript. 1 3 154 M. Ha, Y. Shichkina Data Availability Data available upon request. 11. Jia T, Zhao X, Wang Z, Gong D, Ding G (2016) Model transfor- mation and data migration from relational database to MongoDB. In: IEEE BigData, pp. 60–67 Declarations 12. Varga V, Jánosi-Rancz KT, Kálmán B (2016) Conceptual design of document NoSQL database with formal concept analysis. Acta Conflict of interest The authors declare that they have no known com- Polytech Hung 13(2):229–248 peting financial interests or personal relationships that could have ap- 13. Shaikh NF (2017) Data migration from SQL To Mongodb. Int J peared to influence the work reported in this paper. Eng Res Dev 13(11):32–36 14. Kim HJ, Ko EJ, Jeon YH, Lee KH (2018) Migration from RDBMS Open Access This article is licensed under a Creative Commons Attri- to column-oriented NoSQL: lessons learned and open problems. bution 4.0 International License, which permits use, sharing, adapta- In: Lee W, Choi W, Jung S, Song M (Eds.), Proceedings of the 7th tion, distribution and reproduction in any medium or format, as long International Conference on Emerging Databases. Lecture Notes as you give appropriate credit to the original author(s) and the source, in Electrical Engineering, vol 461. Springer, Singapore. https:// provide a link to the Creative Commons licence, and indicate if changes doi.or g/1 0.100 7/9 78-981 -1 0-652 0-0_3, https://l ink.s pring er.c om/ were made. The images or other third party material in this article are chapt er/ 10. 1007/ 978- 981- 10- 6520-0_ 3# citeas included in the article's Creative Commons licence, unless indicated 15. Lawrence R (2014) Integration and virtualization of relational otherwise in a credit line to the material. If material is not included in SQL and NoSQL systems including MySQL and MongoDB. the article's Creative Commons licence and your intended use is not In: 2014 international conference on computational science and permitted by statutory regulation or exceeds the permitted use, you will computational intelligence, Las Vegas, NV, USA, 2014, pp. 285– need to obtain permission directly from the copyright holder. To view a 290 https:// doi. org/ 10. 1109/ CSCI. 2014. 56 copy of this licence, visit http://cr eativ ecommons. or g/licen ses/ b y/4.0/ . 16. Hsu MJC, Hsu CH, Chen SC, Chung YC (2014) Correlation aware technique for SQL to NoSQL transformation. In: 7th international conference on ubi-media computing and workshops (UMEDIA). IEEE Press, Ulaanbaatar, pp. 43–46 References 17. Zhang H, Wang Y, Han J (2011) Middleware design for integrat- ing relational database and NOSQL based on data dictionary. In: 1. Ha M, Shichkina Y (2020) Creating collections without embed- Proceedings 2011 international conference on transportation, ded documents for document databases taking into account the mechanical, and electrical engineering (TMEE), Changchun, queries. Computation 8(2):45. https://doi. or g/10. 3390/ com put atio China, pp. 1469–1472 https://doi. or g/10. 1109/ TMEE. 2011. 61994 n8020 045 85 2. Shichkina Y (2015) Approaches to speed up data processing in 18. Ma K, Abraham A (2013) Toward lightweight transparent data relational databases. Procedia Comput Sci 150:131–139 middleware in support of document stores. In: 2013 third world 3. Shichkina Y, Kupriyanov M, Al-Mardi M (2016) Optimization congress on information and communication technologies (WICT algorithm for an information graph for an amount of communica- 2013), Hanoi, Vietnam, 2013, pp. 253–257 https:// doi. org/ 10. tions, Lecture Notes in Computer Science (including subseries 1109/ WICT. 2013. 71131 44 Lecture Notes in Artificial Intelligence and Lecture Notes in Bio- 19. Adriana J, Holanda M (2018) NoSQL2: SQL to NoSQL data- informatics), 9870 LNCS, c. 50-62, ISSN: 03029743 ISBN: 978- bases. In: Rocha Á, Adeli H, Reis L, Costanzo S (eds) Trends and 331946300-1, https:// doi. org/ 10. 1007/ 978-3- 319- 46301-8_5 advances in information systems and technologies. WorldCIST’18 4. Karnitis G, Arnicans G (2015) Migration of relational database to 2018. Advances in intelligent systems and computing, vol 746. document-oriented database: structure denormalization and data Springer, Cham transformation. In: 2015 7th international conference on compu- 20. QueryMongo. Query mongo: Mysql to mongodb query translator, tational intelligence, communication systems and networks, Riga, 2019. URL http:// www. query mongo. com. Latvia, pp. 113–118 https:// doi. org/ 10. 1109/ CICSyN. 2015. 30 21. Russell V (2016) sql-to-mongo-db-query-converter URL https:// 5. Toufik F, Bahaj M (2019) Model transformation from object rela-github. com/ vince ntrus sell/ sql- to- mongo- db- query- conve rter. tional database to nosql document database. In: Proceedings of the 22. Reddy R (2015) Mongodb translator - teiid 9.0 (draft), URL 2nd international conference on networking, information systems https:// docs. jboss. org/ aut hor/ displ ay/ TEIID/ Mongo DB+ Tr ans & security (niss19). Association for computing machinery, New lator York, NY, USA, Article 49, pp. 1–5 https://doi. or g/10. 1145/ 33203 23. Bernstein PA, Goodman N, Wong E, Reeve CL, Rothnie JB (1981) 26. 33203 81 Query processing in a system for distributed databases (SDD-1). 6. Al Mahruqi RS, Alalfi MH, Dean TR (2019) A semi-automated ACM Trans Database Syst 6(4):602–625. https://d oi.o rg/1 0.1 145/ framework for migrating web applications from SQL to document 319628. 319650 oriented NoSQL database. In: Proceedings of the 29th annual 24. Moritz D, Halperin D, Howe B, Heer J (2015) Perfopticon: visual international conference on computer science and software engi- query analysis for distributed databases. Comput Graph Forum. neering (CASCON '19). IBM Corp., USA, pp. 44–53https:// doi. org/ 10. 1111/ cgf. 12619 7. Stanescu L, Brezovan M, Burdescu DD (2016) Automatic map- 25. Dai J (2019) SQL to NoSQL: what to do and How. IOP Conf ping of MySQL databases to NoSQL MongoDB. In: 2016 FedC- Ser Earth Environ Sci 234:012080. https://doi. or g/10. 1088/ 1755- SIS, pp. 837–8401315/ 234/1/ 012080 8. Zhao G, Lin Q, Li L, Li Z (2014) Schema conversion model of 26. Wu S, Li F, Mehrotra S, et al. (2011) Query optimization for mas- SQL database to NoSQL. In: 2014 Ninth 3PGCIC, pp. 355–362 sively parallel data processing. In: 2nd ACM symposium on cloud 9. Freitas MC, Souza DY, Salgado AC (2016) Conceptual mappings computing. ACM, p. 12 to convert relational into NoSQL databases. In: Proceedings of the 27. Xu Y, Hu S (2013) QMapper: a tool for SQL optimization on hive 18th ICEIS using query rewriting. In: 28th international conference on world 10. Karnitis G, Arnicans G (2015) Migration of relational database to wide web companion. pp. 211–212 document-oriented database: structure denormalization and data 28. Lee C, Zheng Y (2015) Automatic SQL-to-NoSQL schema trans- transformation. In: 2015 7th ICCICSN, pp. 113–118 formation over the MySQL and HBase databases. In: IEEE inter- national conference on consumer electronics-Taiwan, pp. 426–427 1 3 Translating a Distributed Relational Database to a Document Database 155 29. https:// neo4j. com/ devel oper/ guide- impor ting- data- and- etl/ 32. Ha VM, Shichkina YA, Kostichev SV (2019) Determining the 30. Park Y, Shankar M, Park B-H, Ghosh J (2014) Graph databases composition of collections for key-document databases based on for large-scale healthcare systems: A framework for efficient data a given set of object properties and database queries. Computer management and data services. In: IEEE 30th international con- Tools in Education, no. 3, pp. 15–28. https:// doi. org/ 10. 32603/ ference on data engineering workshops. Chicago, IL, USA, pp. 2071- 2340- 2019-3- 15- 28 (in Russian) 12–19 31. De Virgilio R, Maccioni A, Torlone R (2013) Converting rela- tional to graph databases. In: First international workshop on graph data management experiences and Systems. ACM 1 3

Journal

Data Science and EngineeringSpringer Journals

Published: Jun 1, 2022

Keywords: NoSQL; Database query; Collection; Document; Database structure optimization; Embedded documents

There are no references for this article.