You are currently viewing a new version of our website. To view the old version click .
Algorithms
  • Article
  • Open Access

16 May 2019

An Introduction of NoSQL Databases Based on Their Categories and Application Industries †

and
Department of Information Management, Chaoyang University of Technology, Taichung 41349, Taiwan
*
Author to whom correspondence should be addressed.
This Paper is an Extended Version of the Conference Paper (ID: 1070) in Taichung, Taiwan, 6–8 December 2018, IS3C2018.
This article belongs to the Special Issue Selected Papers from 2018 International Symposium on Computer, Consumer and Control

Abstract

The popularization of big data makes the enterprise need to store more and more data. The data in the enterprise’s database must be accessed as fast as possible, but the Relational Database (RDB) has the speed limitation due to the join operation. Many enterprises have changed to use a NoSQL database, which can meet the requirement of fast data access. However, there are more than hundreds of NoSQL databases. It is important to select a suitable NoSQL database for a certain enterprise because this decision will affect the performance of the enterprise operations. In this paper, fifteen categories of NoSQL databases will be introduced to find out the characteristics of every category. Some principles and examples are proposed to choose an appropriate NoSQL database for different industries.

1. Introduction

The Relational Database (RDB) was developed from the 1970s to present. Through a powerful Relational Database Management System (RDBMS), RDB is easy to use and maintain, and becomes a widely used kind of database []. Due to the popularization of big data acquisition technologies and applications, enterprises need to store more data than ever before. The enterprise’s database is desired to be accessed as fast as possible. To obtain complex information from multiple relations, RDB sometimes needs to perform SQL join operations to merge two or more relations at the same time, which can lead to performance bottlenecks. Besides, except the relational data storage format, other data storage formats have been proposed in many applications, such as key-value pairs, document-oriented, time series, etc. As a result, more and more enterprises have decided to use NoSQL databases to store big data [,,].
However, there are more than 225 NoSQL databases []. How to choose an appropriate NoSQL database for a specific enterprise is very important because the change of database may affect the enterprise performance of the business operations. This paper introduces basic concepts, compares the data formats and features, and lists some actual products for every category of NoSQL databases. In addition, this paper also proposes principles and key points for different types of enterprises to choose an appropriate NoSQL database to solve the business problems and challenges.

3. The Categories of NoSQL Databases

According to the classification of the NoSQL database official website [], there are 15 categories of NoSQL databases such as wide column store, document store, key value store, graph databases, and so on, which are based on different data models. This section will explain the basic concepts of each category of the NoSQL database and analyze the characteristics of the data that each category of the NoSQL database is suitable for processing.

3.1. Wide Column Store

This category of NoSQL databases has a complex table schema described as follows [,,,].
  • A row key is an identification that has a unique value used to identify a specific record, similar to the primary key of a relation in RDB.
  • A timestamp (abbreviated as ts) is an integer used to identify a specific version of a data value.
  • At least one column family that has the format of “Family: Qualifier = Value,” where “Family” is the name of a column family, “Qualifier” is the name of a column qualifier, and “Value” is a real value of a column qualifier stored in text.
  • The name of a column family need to be defined when the table is created, but the name of a column qualifier does not.
  • Users can find the actual data value through the value of a specific row key, the name of a specific column family, the name of a specific column qualifier, and the value of a specific timestamp.
An example is illustrated as follows. An inventory table of 3C products in a wide column store database is shown in Table 3, where:
Table 3. An example of a data table in wide column store.
  • Products_Inventory is the name of the inventory table, which contains two column families, products, and inventory, and has three records with the product codes P001, P002, and P003 as the values of three row keys, respectively;
  • An increasing integer ti (i = 1, 2, …, 18) is the value of timestamp for each column qualifier when a data value of a column qualifier is inserted into the table;
  • Column family products includes four column qualifiers: Classes, title, descriptions, price, and their data values, for example, are “TV”, “SONY 55 inch 4K OLED Smart Networked TV”, “TBD”, and “24999”, respectively;
  • Column family inventory includes two column qualifiers: Quantity, place, and their data values, for example, are “10” and “1A”, respectively.
According to the statistics of the DB-Engines Ranking website [], Apache Cassandra and Apache HBase are the more widely discussed ones of the wide column store databases.

3.2. Document Store

The terms related to the database model of the document store are described below [].
  • A collection is a group of documents. The documents within a collection are usually related to the same subject, such as employees, products, and so on.
  • A document is a set of ordered key-value pairs, where key is a string used to reference a particular value, and value can be either a string or a document.
  • JSON (JavaScript Object Notation), BSON (Binary JSON), and XML (eXtensible Markup Language) are formats commonly used to define documents.
  • Embedded documents are documents within documents. An embedded document enables users to store related data in a single document to improve database performance.
  • Document store databases do not require users to formally specify the structure of documents prior to adding documents to a collection. Therefore, document databases are called schemaless ones. Application programs should verify rules about the structure of a document.
An example of a collection in a document store database is shown in Figure 2. As a JSON file format, this document stores school curriculum data. There are three courses, “Accounting”, “Economics”, and “Computer Science”, in this file. Each course contains four fields, c_no, title, credits, and instructor.
Figure 2. An example of a collection in a document store database.
According to the statistics of the DB-Engines Ranking Website [], the MongoDB and Couchbase Server are the more widely discussed ones of the document store databases.

3.3. Key Value Store

The data in this category of NoSQL databases is stored with the format of “Key → Value” [], where
  • Key is a string used to identify a unique value;
  • Value is an object whose value can be a simple string, numeric value, or a complex BLOB (binary large object), JSON object, image, audio, and so on;
  • In key value store databases, operations on values are derived from keys. Users can retrieve, set, and delete a value by a key;
  • A namespace is a logical data structure that can contain any number of key-value pairs.
Suppose that an online shopping website uses a key value store database to store data as shown in Figure 3. This database includes several namespaces, such as “products” and “customers” [], where
Figure 3. An example of two namespaces in a key value store database.
  • The key in the namespace “Products” is the ID of products, and the value is the details about products;
  • The key in the namespace “Customers” is the ID of customers, and the value is the details about customers.
According to the statistics of the DB-Engines Ranking Website [], both Redis and DynamoDB are the more widely discussed ones of the key value store databases.

3.4. Graph Databases

The graph database model (GDM) is composed of vertices and edges [], where
  • A vertex is an entity instance, which is equivalent to a tuple in RDM;
  • An edge is used to define the relationship between vertices;
  • Each vertex and edge contains any number of attributes that store the actual data value.
An Oceania airline is illustrated as an example. The airline needs to store flight hours among some cities. The data can be stored in a graph database as shown in Figure 4. In this graph database, each vertex contains some data such as nation, city, and A2C_time (time from an airport to a city center), and each edge represents the flight duration between two cities [].
Figure 4. An example of data stored in a graph database.
According to the statistics of the DB-Engines Ranking website [], Neo4J and FlockDB are the more widely discussed ones of the graph databases.

3.5. Multimodel Databases

The data format of this category of NoSQL databases contains more than two data formats of the other categories of NoSQL databases []. According to the statistics of the DB-Engines Ranking website [], OrientDB and ArangoDB are more widely discussed ones of multimodel databases. OrientDB contains the data formats of object database, document store, graph database, and key value store; while ArangoDB contains the data formats of document store, graph database, and key value store [].

3.6. Object Databases

This category of NoSQL databases combines the functions of object-oriented programming languages and traditional databases []. A web-based application system, which provides users to order lunch boxes, is illustrated as an example. The data in the object databases are described in the form of a class diagram as shown in Figure 5 []. In Figure 5, each rectangle is an object that includes both data items and data processing functions. For example, the object Customers has four data items (account, password, telephone, and e-mail) and two data processing functions (readData() and writeData()). According to the statistics of the DB-Engines Ranking website [], db4o and Versant are the more widely discussed ones of the object databases.
Figure 5. An example of a class diagram in an object database.

3.7. Grid and Cloud Database Solutions

This category of NoSQL databases stores recent access data in random access memory (RAM) and uses grid computing to speed up the time of access data from a database []. According to the statistics of the DB-Engines Ranking website [], Hazelcast and Oracle Coherence are more widely discussed ones of grid and cloud database solutions.

3.8. XML Databases

The files stored in this category of NoSQL databases are based on the XML format []. An example of a school curriculum file stored in an XML database is shown in Figure 6. In this XML file, there are three courses, internet of things, artificial neural network, and big data, which have course numbers (c_no), C001, C002, and C003, credits, 3, 4, and 2, and instructors, Amy, Zoe, and Mary, respectively. According to the statistics of the DB-Engines Ranking website [], Oracle Berkeley DB and BaseX are the more widely discussed ones of the XML databases.
Figure 6. An example of a data file in an XML database.

3.9. Multidimensional Databases

The data in this category of NoSQL databases is stored in a multidimensional array in order to analyze the value of each array element. Suppose a printing company stores data in a multidimensional database as shown in Figure 7 []. The printing company needs to analyze the total sales amount of printed products based on three dimensions: Products, branches, and customer rank. For example, the company has two branches, Taipei and Tainan, three products, copy paper, photo paper, and poster, and two customer ranks, platinum member and normal member. The boss of the printing company wants the total sales amount of each branch, each product, and each customer rank. According to the statistics of the DB-Engines Ranking website [], intersystems cache and GT.M are the more widely discussed ones of the multidimensional databases.
Figure 7. An example of a three-dimensional array in a multidimensional database.

3.10. Multivalue Databases

This category of NoSQL databases is suitable for storing data of multivalued attributes or composite attributes []. An example of student data is illustrated in a table of multivalue databases as shown in Table 4. The schema of the table is students (SID, name, and society), where name is a composite attribute composed of the two attributes, First_name and Last_name, society is a multivalued attribute. There are six records in this data table, the name of each student is divided into two parts to save into the attributes, First_name and Last_name, respectively, and the attending societies of each student can have more than one value. According to the statistics of the DB-Engines Ranking website [], jBASE and Model 204 Database are the more widely discussed ones of the multivalue databases.
Table 4. An example of a data table in a multivalue database.

3.11. Event Sourcing

This category of NoSQL databases is suitable for storing events that occurred in the past in order to track the status of a specific event. An example about a lecture registration system to store the data in an event sourcing database is shown in Table 5. In this table, the first two fields, time and person, can be considered as an event, and the last field current enrolment number is used to track the number of people currently enrolled in the lecture []. According to the statistics of the DB-Engines Ranking website [], event store is the most widely discussed one of the event sourcing databases.
Table 5. An example of a data table in an event sourcing database.

3.12. Time Series Databases (TSDBs)

This category of NoSQL databases is designed to handle time series data [,]. An example of air quality data is illustrated as follows. Assume that an observing station measures the air quality index (AQI) and the density of PM2.5 once an hour and transmits the measurement result to a time series database (TSDB), and the results in 2018 are shown in Table 6 []. According to the statistics of the DB-Engines Ranking website [], Informix Time Series Solution and influxdata are the more widely discussed ones of the TSDBs.
Table 6. An example of a data table in a time series database (TSDB).

3.13. Scientific and Specialized DBs

This category of NoSQL databases is designed to solve scientific and professional issues. For example, BayesDB allows users who have not been statistically trained to solve basic science problems, and GPUdb is a database suitable for distributed computing [].

3.14. Other NoSQL Related Databases

The NoSQL databases in this category seem to be able to be categorized into several other categories mentioned earlier, but the official website of NoSQL database [] categorizes them into this special category without giving any explanation for the characteristics of this category of NoSQL databases. Therefore, we have no way to know why this category is needed and the reasons why these NoSQL databases are assigned to this category. According to the statistics of the DB-Engines Ranking website [], eXtremeDB is the most widely discussed one of other NoSQL related databases.

3.15. Unresolved and Uncategorized

Any NoSQL database will be assigned to this category of NoSQL databases if it cannot be classified into any of the previously mentioned categories of NoSQL databases. According to the statistics of the DB-Engines Ranking website [], Adabas and CodernityDB are the more widely discussed ones of the unresolved and uncategorized databases. By the way, the characteristics of the two categories of NoSQL databases, unresolved and uncategorized and other NoSQL related databases, are similar. This is based on the classification of the NoSQL database official website []. We do not know the basis of the classification.

3.16. Summary

The basic concepts of each category of NoSQL databases have been described. Then, all the categories of NoSQL databases are analyzed to get the results that each NoSQL database is suitable for processing certain features of data. The results are summarized in Table 7.
Table 7. Summary of suitable data features for NoSQL databases.

4. Choose an Appropriate Database

4.1. The Principles of Database Selection

If an enterprise prepares to choose a NoSQL database, it must understand the following questions according to the cultures and characteristics of the enterprise.
  • Understand the current problems, goals, and challenges of the corporate operation database.
  • The engineers of the IT center or database administrators (DBAs) must decide to continue using the current RDB or change using a NoSQL database based on the needs of enterprise and their expertise.
  • If changing to use a NoSQL database, the IT engineers or DBA first select a suitable category of NoSQL databases based on the features and formats of the enterprise’s operating data.
  • When deciding which NoSQL database to choose, the IT engineers or DBA can make a decision according to the needs of the enterprise, the characteristics of each database, as well as the reputation and popularity of each database on websites (for example, DB-Engines Ranking website [], vschart []). The more websites we query for this information, the more accurate the reputation and popularity of each database, and the more we can find the right NoSQL database.

4.2. Database Selection Case 1

Suppose that a 3C shopping website uses an RDB to store data for a long period of time, and this RDB generates 300,000 records per day. Users reflect that the website is slower, and hope the data processing speed to be as fast as possible, so the business owner asks the information department staff to solve this problem.
The head of the information department traced the reasons according to the boss instructions and found that the reasons for the slower access speed of the data are not only a large amount of data generated every day but also the need for many users to merge several tables of RDB with a large amount of data. Therefore, the supervisor recommends using the NoSQL database as a solution because NoSQL databases can merge some tables of RDB in advance so that when querying a NoSQL database, the desired data can be read quickly without waiting much time to do join operations.
After the business owner agrees, the head of the information department will then decide which NoSQL database to use. The decision process is as follows.
  • The most suitable category of NoSQL database is the wide column store because access to the database often requires searching for data in a specific field.
  • According to the DB-Engines Ranking website [], the wide column store databases that are more commonly discussed on the internet are Apache Cassandra and Apache HBase.
  • According to the experimental results of Chen et al. [], the time of Apache HBase to read data is less than that of Apache Cassandra. Therefore, Apache HBase is recommended as the NoSQL database used by the enterprise.

4.3. Database Selection Case 2

In order to understand the news reporting strategy of a peer, a famous newspaper must collect online news from various newspapers or media, so about tens of thousands of online news are stored for analysis every day. The information staff found that RDB could not provide quick access to a large amount of data in time; therefore, it is recommended to use the NoSQL database to solve the problem of too slow access rate.
In response to this question, the head of the information department must decide which NoSQL database to use for the newspaper company. The decision-making process is as follows.
  • Since the newspaper needs to collect files generated by a large number of instant messages such as tens of thousands of online news and related readers’ messages every day, it is necessary to replace the RDB with a NoSQL database.
  • There are fifteen categories of NoSQL databases available, and the category found to be suitable for storing news multimedia materials is the document store.
  • According to the DB-Engines Ranking website [], the document store database that is often discussed on the internet has two NoSQL databases, MongoDB and Couchbase Server. Since the former has a higher market share than the latter, it is recommended to use MongoDB as the NoSQL database for the company.

4.4. Database Selection Case 3

An American multinational retail enterprise has the following challenges of its website [].
  • This corporation intends to promote “online recommendation system optimization” to recommend suitable products for each customer who visits the website of this corporation.
  • To achieve the goal, the database must join a large number of customer and product data quickly to analyze in time the needs and trends of customers to products.
  • However, the RDB this company used cannot meet the above requirement.
For the above reasons, the head of the information department in this corporation decides to use a NoSQL database to replace RDB. The decision process is described as follows.
  • The most suitable category of NoSQL database for the enterprise is graph databases because graph databases is the most suitable for the recommendation system as described in Table 7.
  • According to the statistics of DB-Engines Ranking website [], the most discussed NoSQL database in graph databases are Neo4j and FlockDB.
  • Since Neo4j has the best market share among all graph databases []; thereby, Neo4j is recommended as the NoSQL database used by this enterprise.

5. Conclusions

The main contents of this paper are as follows. First of all, we introduce the basic characteristics of the fifteen categories of NoSQL database (such as the wide column store, document store, key value store, and graph databases, etc.) in the NoSQL database official website []. Then we analyze the characteristics of the data that each category of NoSQL database is suitable for processing. Next, we propose some principles and key points for reference to help enterprises to find an appropriate NoSQL database from more than 225 ones when enterprises intend to abandon the use of RDB to use NoSQL database. Finally, we illustrate three cases, 3C shopping website, newspapers, and the US retail industry, to demonstrate how a particular company can choose a suitable NoSQL database to improve its competitiveness and customer services.
In summary, if a company abandons RDB and switches to NoSQL DB, it needs to consider the characteristics of the company’s data in order to find the right DB. The transaction data of the e-commerce industry often needs to be related, the suitable NoSQL DB category is the wide column store, and Apache HBase is a good choice. The news materials of the news industry have semi-structured features. The suitable NoSQL DB category is the document store, and the better choice is MongoDB. The retailer data needs to be used by the recommendation system, so the suitable NoSQL DB category is the graph databases, and the best choice is Neo4j. We hope that these principles and examples will help decision makers to change databases correctly.

Author Contributions

Conceptualization, J.-K.C.; methodology, J.-K.C. and W.-Z.L.; writing—original draft preparation, W.-Z.L.; writing—review and editing, J.-K.C.; supervision, J.-K.C.; project administration, J.-K.C.

Funding

This research received no external funding.

Acknowledgments

Thanks to the reviewers for providing a lot of valuable comments to make this paper more complete.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chen, H.A. Database System: Concept, Design, and Implementation, 3rd ed.; XBOOK MARKETING Co., Ltd.: Taipei, Taiwan, 2013. (In Chinese) [Google Scholar]
  2. NoSQL Databases. Available online: http://nosql-database.org/ (accessed on 20 January 2019).
  3. Pi, S.J. Establish the Cornerstone of Big Data: NoSQL Database Technique, 2nd ed.; TopTeam Information Co., Ltd.: Taipei, Taiwan, 2016. (In Chinese) [Google Scholar]
  4. Lu, J.H. Challenge Big Data, How to Process Big Data in Facebook, Google, Amazon? Use NoSQL to Get 10 Billion Annual Hard Disk Data, 2nd ed.; TopTeam Information Co., Ltd.: Taipei, Taiwan, 2015. (In Chinese) [Google Scholar]
  5. Sullivan, D. NoSQL for Mere Mortals, 1st ed.; Pearson P T R: London, UK, 2015. [Google Scholar]
  6. Hecht, R.; Jablonski, S. NoSQL Evaluation: A Use Case Oriented Survey. In Proceedings of the 2011 International Conference on Cloud and Service Computing, Hong Kong, China, 12–14 December 2011. [Google Scholar]
  7. Lourenço, J.R.; Cabral, B.; Carreiro, P.; Vieira, M.; Bernardino, J. Choosing the right NoSQL database for the job: A quality attribute evaluation. J. Big Data 2015, 2, 18:1–18:26. [Google Scholar] [CrossRef]
  8. Corbellini, A.; Mateos, C.; Zunino, A.; Godoy, D.; Schiaffino, S. Persisting big-data: The NoSQL landscape. Inf. Syst. 2016, 63, 1–23. [Google Scholar] [CrossRef]
  9. Khazaei, H.; Fokaefs, M.; Zareian, S.; Beigi-Mohammadi, N.; Ramprasad, B.; Shtern, M.; Gaikwad, P.; Litoiu, M. How do I Choose the Right NoSQL Solution? A Comprehensive Theoretical and Experimental Survey. Big Data Inf. Anal. 2016, 1, 185–216. [Google Scholar]
  10. Gessert, F.; Wingerath, W.; Friedrich, S.; Ritter, N. NoSQL database systems: A survey and decision guidance. Softw.-Intensiv. Cyber-Phys. Syst. 2017, 32, 353–365. [Google Scholar] [CrossRef]
  11. Davoudian, A.; Chen, L.; Liu, M. A Survey on NoSQL Stores. ACM Comput. Surv. (CSUR) 2018, 51, 40:1–40:43. [Google Scholar] [CrossRef]
  12. Dimiduk, N.; Khurana, A. HBase in Action, 1st ed.; Oreilly & Associates Inc.: New York, NY, USA, 2012. [Google Scholar]
  13. Lu, J.H. Hadoop: Practical Technical Handbook, 2nd ed.; TopTeam Information Co., Ltd.: Taipei, Taiwan, 2014. (In Chinese) [Google Scholar]
  14. George, L. HBase: The Definitive Guide, 1st ed.; Oreilly & Associates Inc.: New York, NY, USA, 2011. [Google Scholar]
  15. DB-Engines Ranking. Available online: https://db-engines.com/en/ranking (accessed on 4 March 2018).
  16. Multi-Model Databases (Wikipedia). Available online: https://en.wikipedia.org/wiki/Multi-model_database (accessed on 15 June 2018).
  17. Wu, R.H. Object-Oriented System Analysis and Design: An MDA Approach with UML, 4th ed.; BestWise Co., Ltd.: Taipei, Taiwan, 2013. (In Chinese) [Google Scholar]
  18. Document-Oriented Database (Wikipedia). Available online: https://en.wikipedia.org/wiki/Document-oriented_database (accessed on 15 June 2018).
  19. Multidimensional Databases. Available online: https://docs.oracle.com/cd/E12478_01/rpas/pdf/150/html/classic_client_user_guide/basic_rpas_concepts/multidimensional_databases.htm (accessed on 5 May 2018).
  20. MultiValue (Wikipedia). Available online: https://en.wikipedia.org/wiki/MultiValue (accessed on 15 June 2018).
  21. Introducing to Event Sourcing. Available online: https://msdn.microsoft.com/en-us/library/jj591559.aspx#sec1 (accessed on 16 January 2018).
  22. Time Series Database (Wikipedia). Available online: https://en.wikipedia.org/wiki/Time_series_database (accessed on 16 January 2018).
  23. Time Series (Wikipedia). Available online: https://en.wikipedia.org/wiki/Time_series (accessed on 16 January 2018).
  24. Central Weather Bureau. Available online: https://www.cwb.gov.tw/eng/index.htm (accessed on 10 July 2018).
  25. vsChart.com: The Comparison Wiki: Database List. Available online: http://vschart.com/list/database/ (accessed on 18 February 2019).
  26. Chen, C.Y.; Chang, B.R.; Tsai, H.F.; Guo, C.L. Empirical Analysis of High Efficient Remote Cloud Data Center Backup Using HBase and Cassandra. Sci. Progr. 2014, 2015, 1–10. [Google Scholar]
  27. Neo4j: Walmart Case Study. Available online: https://neo4j.com/case-studies/walmart/ (accessed on 10 December 2018).

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.