A short time ago a colleague asked me where he could find a “Databases 101” guide for the non-technical professional. As it turns out, the internet is littered with information, and mis-information, regarding data and databases. This makes it difficult for someone entering the field of data and databases to understand the bigger picture. He wanted something to help make sense of data and databases, specifically relational versus NoSQL.
So, that’s what I decided to work on. This is the start of my Databases 101 guide for the non-database professional; something to help anyone understand why the word “database” is an overloaded term. We use the term “database” to describe a great many things. To show how far down the rabbit hole we have gone, all you need are three words: “Microsoft Access Database”.
[Microsoft Access is not a database. And Microsoft Access is a database. It’s Schrödinger’s database, really. It’s likely to be the database of choice for the quantum computers once they are online. But I digress.]
This post is organized into the following sections:
Relational Databases
Non-relational Databases
CAP Theorem
Key-Value Databases
Document Databases
Graph Databases
Additional Factors to Consider
ACID vs BASE
Availability
Disaster Recovery
Performance
Complementary technologies
ETL
Infrastructure options
Vendors
Monitoring and Alerting
Business support
First, we will look at the different types of databases available today. Then we will discuss ways for you to choose the right database for your requirements. Finally, I list out some additional factors to consider when trying to decide on any specific database technology.
Also warrants mentioning: I use the terms database and database engine interchangeably in this article.
Let’s get started.
Types of Databases
To me, databases only have two classifications: relational databases, and everything else. I don’t care what anyone else tells you. Those are the two buckets you start with. Either your database engine is relational, or not.
The DB-Engines website is a good resource to understand the myriad of database options that exist. I’ve written before about this ranking and why it is good, but not perfect:
“I’ve been following the DB-Engines ranking for a few years and would encourage you to do the same. The rankings are not an exact science. You can read for yourself how they are calculated. One thing to note here is the selection bias in how they collect their data. The rankings show a clear preference for systems that have a lot of engagement online. It doesn’t talk about revenue, the number of installations, or if the engagement online is negative or positive.”
I will use data from DB-Engines to break things down into more detail.
Relational databases
As of July 1st, 2018, there are 138 relational databases (out of 343 total) listed at DB-Engines. Relational is by far the largest category. The top four ranked databases at DB-Engines are relational engines, and six of the top ten overall. Yes, Microsoft Access is there, currently ninth. I’ll let that sink in for a moment.
Historically, what defined a database as relational (in a traditional sense) was the ability for database transactions to adhere to the ACID properties. ACID stands for Atomicity, Consistency, Isolation, and Durability. You can read more about these concepts and other database transaction concepts in this paper written by Jim Gray in 1981.
This is not to say that only relational databases support ACID, but you will find that the majority of differences between relational and non-relational databases are ACID versus BASE (which we will discuss next).
Relational databases leverage ACID for database transactions in order to maintain data quality and integrity. And, to some computer folk, this is a weakness in the design of a relational engine. Favoring data quality and integrity comes at a cost. The costs are performance, latency, storage, and throughput.
The truth is that relational databases scale up just fine when placed in the hands of people that (1) know what they are doing, (2) aren’t afraid to learn new things, and (3) don’t mind doing the work necessary.
I said scale up, which is another way of saying “throw hardware at the problem”. Scaling out means you add more nodes to handle the workload. Scaling up versus scaling out is a traditional difference between relational and non-relational database engines. Often when you want more performance from a relational engine you will scale up. But for non-relational systems it is easier to scale out. Scaling out is often cheaper and easier than scaling up.
Non-relational databases
In contrast to relational database engines years ago a handful of systems arose that marketed themselves as “No SQL”, as an alternative to relational database engines. Over time that stance softened, and instead of the anti-SQL stance they became friendlier by labeling themselves as “Not Only SQL”, or NoSQL. This is the group I classify as non-relational. DB-Engines has 205 systems listed as non-relational, but with many sub-categories.
The most popular NoSQL system types are key-value stores (66), document stores (46), graph databases (31), time-series databases (25), and search engines (17). There are also 36 NoSQL databases listed as “multi-model”, meaning they are more than one non-relational engine. An example of a multi-model engine is Cosmos DB, as it supports document, graph, key-value, table, and column-family data models.
As stated earlier, NoSQL solutions favor performance over data quality through the use of BASE. BASE stands for Basic Availability, Soft-state, and Eventual consistency. If I had to sum up the differences between relational and non-relational database engines it would be this:
Relational engines require consistency at the end of each transaction, whereas non-relational engines only require that the database be consistent at some point in time (i.e., eventually).
This is not a hard line though. You can find plenty of counter examples, where a relational engine allows for eventual consistency. Or a NoSQL engine that embraces ACID. Over the decades, software providers have been baking lots of functionality into their products. This page for SQL Server at DB-Engines lists ‘secondary engines’ of document, graph, and key-value. That’s a lot of functionality inside a traditional relational engine.
I’m offering a way to help you understand the differences at a high level, so you can make an informed decision about what database engine would best suit your specific requirements.
Choosing the Right Database
This is a natural question that many will ask: Which database do I want? Time is spent trying to determine which database is the right database. The truth is that your application systems are complex, and when broken down into pieces you will find that different pieces are best served by different database engines. Netflix is a great example of this, as they use a handful of database engines in order to process the billions of data points they collect and analyze daily.
So, set aside the idea that you are going to find the one true database that is better than everything else. Start thinking about how you can find the right database for different pieces of your complex systems.
Now, let’s look at how you can make an informed decision as to the database engine that will work best for your needs.
CAP Theorem
CAP stands for Consistency, Availability, and Partition tolerance. The theorem states that you cannot have all three, as there are natural tradeoffs between the items. Similar to “fast, cheap, and easy, pick two”, that’s also what people say about CAP: “consistency, availability, or partition tolerance, choose two.”
Consistency means that any read request will return the most recent write. But as we have discussed earlier, consistency need not be immediate. This graph helps explain the different consistency levels:
Availability means that a non-responding node must respond in a reasonable amount of time. This is different than partition tolerance, which states the system will continue to operate despite network or node failures.
When you put these three items together you should start to understand why you can only ever prioritize two of them at a time. For example, if your system needs to be available and partition tolerant, then you must be willing to accept some latency in your consistency requirements.
Here’s a quick visualization of the CAP theorem:
Let’s look at each side the triangle above (labeled CA, AP, CP). Traditional relational databases are a natural fit for the CA side. That is, they can feature strong consistency and be highly available, but often at the expense of partition tolerance.
Non-relational database engines are meant to satisfy AP and CP requirements. This is why we have so many flavors of NoSQL: key-value, document, and graph, to name a few. We have two sides of the triangle supported by NoSQL database engines. That means we have a large variety of availability and partitioning requirements along with specific query and workload requirements. It is easy to understand why we have so many NoSQL solutions available.
As a very general classification you could think of things this way: relational databases are optimized for writes, and NoSQL databases are optimized for reads. That’s not an absolute for each engine, just a generalization to understand why the engines were originally built.
Key-value
Key-value databases work best for scenarios where you have a simple schema, you are doing many read/writes and few updates, you need something that can scale across many nodes and perform well, and you don’t have complex queries that require joins. Examples of key-value databases are Redis, DynamoDB, and Cosmos DB. An example of a company using key-value to achieve high performance is Snapchat, who moved their Snapchat Stories feature to DynamoDB.
Document
Document databases work best for scenarios where you need a flexible schema, your data is in XML or JSON format, you have a balanced read to write ratio, require high read performance, and desire to use indexes for performance improvements. Examples of document database are MongoDB, DynamoDB, and Couchbase.
[Wait, didn’t we have Dynamo as a key-value database? Yes, we did. Dynamo is one of those multi-model databases I mentioned at the start. So, you can find the same NoSQL database being touted as a solution for more than one scenario. This is why confusion exists in the market over time, which is one of the reasons I am writing this post.]
Graph
Graph databases are perfect for scenarios where you need to create and navigate between nodes, or points of data, and return details about the relationship between those nodes. Examples of graph databases are Neo4j, Cosmos DB, and Amazon Neptune. In one use case, Walmart uses Neo4j to power their recommendation engine.
Additional Factors to Consider
OK, now we know a bit more about the types of databases, as well as how they function and what problems they can solve. And we looked at the CAP theorem as one way to help evaluate which database engine will meet your requirements. But the reality is that there are many additional factors to consider when selecting a database engine. I’ve mentioned a few already, but lets’ build a list and add some commentary.
ACID vs BASE
Relational database engines do ACID very well, whereas NoSQL engines use BASE to achieve the performance offered through horizontal scaling, high availability, and fault tolerance. This is why I stated earlier that relational systems tend to favor writes, and NoSQL systems tend to favor reads.
Availability
Your database engine must have the ability to be highly available. However, that may come at a high price, too. For example, clustering or adding nodes will also increase hardware and software costs. Make certain you understand the topology of the HA architecture and all associated costs.
Disaster Recovery
You will want to review your disaster recovery plans to make certain your database engine meets your requirements and defined SLAs. If it takes too long to backup and restore, you may not want that engine. Or, if backups aren’t an option, and you need to restore or recreate a bunch of flat files, you’ll want to know that now before disaster strikes. Oh, and before I forget, here is your reminder that HA != DR.
Performance
One of the most nebulous statements out there today: “that database doesn’t perform well”. You must know the workload ratios of reads to writes. You need to know if your database engine will maximize one or the other (or both). Also, you should understand that performance may be a result of a tradeoff somewhere else. Rocket-powered transaction throughput sounds great until you understand potential data loss may happen. If you have a need of strong consistency, then you may not get the throughput promised.
Complementary technologies
Chances are you will be looking to use some other piece of technology, such as Spark, Kafka, or RedisCache. You should consider if your engine is robust enough to connect with these other platforms as needed, or if you will need to spend the time and money to build the hooks yourself.
ETL
Speaking of complementary technologies, chances are you will need to be moving data into, and out of, that database. If you already have a preferred ETL provider, make certain it will work with your choice of database engine.
Infrastructure options
The database engine may require you to provide the infrastructure necessary for operations. Or it could be part of a larger data platform offering by a major cloud provider. Another item to consider here is one of data migrations. Spend the time evaluating the effort required to move your data as needed.
Vendors
You may already have a preferred infrastructure vendor, such as Microsoft or AWS. If so, you can leverage that existing relationship to lower costs. Otherwise, you may find that the time, money, and effort required to bring on new hires with the necessary skills for your chosen database engine to be prohibitive.
Monitoring and Alerting
Whatever database you choose you will want to make certain you can monitor, alert, and act upon the alerts as necessary. You don’t want to decide upon a database solution that also requires you to roll your own monitoring solution. Or, maybe you do, but you should factor that as an additional expense, and possibly needing to bring in some expertise.
Business support
At the end of the day, if you don’t have a champion for your choice of database engine on the business side of the fence, then your project won’t go anywhere. If you’ve read this far, then take this advice: Get a champion first, before you invest too much of your time, effort, blood, sweat, and tears into selecting the perfect database(s).
Summary
The complex systems of today require multiple database engines to provide efficient solutions that a single engine of yesterday cannot. The major database vendors are adding cross-functionality support (relational and non-relational features) to their products in response to the needs of their customers. As a result, many of today’s database engines appear to be similar to one another, and it can be difficult to understand what the term “database” even means.
It is my hope that you can use this article as a guide to navigate the world of relational versus NoSQL database engines.
References
https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels
https://www.allthingsdistributed.com/2018/06/purpose-built-databases-in-aws.html
7 thoughts on “Databases 101”