Relational Databases - A Deep Dive
Most of us have worked with relational databases, as they are unarguably the most popular type of database today. For each application, we’ve created multiple tables, each with multiple columns, and some tables can contain a huge number of rows as well. Additionally, we often establish multiple connections for our applications to concurrently fetch information from the database. Some of us might also work with batch applications that mostly feature long running queries.
Have you ever wondered how the database manages to fetch data from multiple tables with multiple connections, often executing in parallel, while also efficiently handling insert, update, and delete operations simultaneously?
In this article, we will see how the database organizes the data on the disk so that it can be fetched effectively later. We'll also look at how it uses indexing mechanism for primary keys to quickly retrieve the data we're searching for from the massive pile of records. Additionally, we'll dive into how secondary indexes work: how the database stores these indexes and maps the data to them when we create them as users.
To understand these characteristics, let’s consider the user table below. Imagine this table contains over 100,000 records. Our discussion will be based on this example.
How Databases Efficiently Organize Data on Disk
What happens when we read a file from the disk? The system grabs a chunk of sequential data from the file and begins processing it. Each read operation incurs a significant amount of overhead like seeking the appropriate location on the disk, establishing the connection with the disk and transferring the data into memory. As a result, the system retrieves larger chunks of data in order to reduce the number of disk I/O operations and improve overall efficiency.
To align with this strategy, Databases arrange the data in fixed size blocks called pages. The size of a page is typically 4 KB, but it can be larger based on the specific database configuration. When you are looking for the data, the database reads the entire page from disk into memory.
All the rows from the above table are stored as continuous array of bytes in pages like below. This is a simplified representation. In reality, databases store data in a binary format within pages, where text, numbers, images, or other types are represented as sequences of bytes.
A database stores data either in heap-organized tables or index-organized tables. In heap-organized tables, data is not stored in any specific order; it is placed into pages as it is inserted. Index-organized tables, on the other hand, store data in a structured order based on the primary key. We can discuss index-organized tables in more detail later in the article.
How Databases locate the data quickly
As we saw above, raw data will be stored in the collection of pages. When we run the query "SELECT * FROM user" the database begins fetching data from the start of the page where the first record of the user table is stored, continuing until it reaches the last record. In certain clients (such as Oracle SQL Developer), there may be a predefined limit, such as 500 records, and the database fetches data up to that limit.
But what if we want to query the records based on the primary key (user_id). For example when applying a filter criteria such as "SELECT * FROM user WHERE user_id = 50001" the database cannot simply start processing from the beginning of the table. Given the substantial volume of reads and writes databases handle, executing this process for every query would prove time-consuming and inefficient in terms of disk resource usage.
Therefore, the database needs to have an indexing mechanism where we are able to find the data quickly. Most databases have a default index, kind of a map, that maps primary keys to the actual locations of data in the page. We will discuss how to create indexing mechanisms for secondary keys later.
For creating the index structure, we need to find a data structure that handles inserts, updates and search operations in a quick manner. For that, we need to measure how the data structure performs when we increase the number of records in the table. This measurement is called time complexity. Time complexity helps in predicting how a data structure and algorithm will perform as the size of the data it handles increases.
Big O notation is the most commonly used notation for expressing time complexity. It focuses on the worst-case scenario, specifies the maximum time an algorithm can take to complete. For a data set comprising 1 billion records, here are the time complexities corresponding to various Big O notations.
From the above diagram, it is clear that we should aim for a time complexity of either O(1) or O(log n) because higher complexities than these two can degrade the performance.
If we consider common data structures like array, linked list, stack, queue etc., at least one of the search, insert, or update operations has a time complexity greater than O(log n).
A data structure close to achieving our requirement is binary search tree. It is not used as an indexing mechanism in database but discussing the binary search tree is important here because its implementations are similar to the data structure implemented in databases.
Binary Search Tree
First, let’s explore what a binary tree is. A binary tree is a data structure where each node has a parent, and each parent node has exactly two children.
Here the ‘child 1’ and ‘child 2’ nodes are also called as leaf nodes because the nodes have no child nodes.
A binary search tree(BST) is a binary tree which satisfies the below condition. Given a node
All nodes in the left subtree have values less than the node's value.
All nodes in the right subtree have values greater than the node's value.
In the above binary search, for every node, the value of the left child will be less than that of the parent node, and the value of the right child will be greater.
Now, we can see how we can perform search, insert and delete in the binary search tree.
To search the value ‘8‘ in the above tree, we have to begin from the root node.
Since searched value 8 < 10 (root), we move to its left node.
Then, as 8 > 6, we proceed to its right node.
And in the third step, we found out 8.
If we need to insert 7, we have to follow the above 3 steps and upon reaching the fourth step, we can insert 7 as the left child of 8.
Deletion in a binary search tree is similar to insert if the deleted node is the leaf node. In case of deleting a node with one children, we can simply reassign the child node to the parent node. But the process is bit more complex if we need to remove the node which has two children (in this case 10). The node’s value can be replaced with the smallest value in the right subtree (13) or the largest value in the left subtree(8).
As you saw in the above example, for all operations, at each step, we either traverse to the left subtree or the right subtree cutting the total elements to search in half. This feature enables inserts, and updates and search to happen in logarithmic time (O(log n)) within a binary search tree.
But there is an issue: what if the binary search tree becomes skewed? For example, if we insert the elements 5, 4, 3, 2, 1 in the same order, the tree will be skewed to the left.
For the above skewed binary search tree, the time complexity for inserts, updates, and search operations will increase to 𝑂(𝑛). This is because now the tree is basically converted into a linked list, so if we have to find an element such as 1, we have to traverse all the way from root to leaf node.
To address this limitation, we have balanced binary search trees such as AVL Trees and Red-Black Trees, which are closely related to binary search tree implementations and prevent the above skewed tree formations. So these trees will guarantee that the tree's height remains balanced and, even in the worst-case scenarios, keep the time complexity of all operations at O(log n).
Memory vs Disk data structures
There is another challenge in working with a “binary“ tree on disk.
Implementations of binary search trees can be very effective when used in memory, since they are cutting the search space by half. But, their efficiency decreases significantly when used with disk storage.
To access a specific node, the system needs to find the specific block on the disk where the node is stored. In traditional HDDs, this process involves rotating the disk’s pointer to the correct position, which is a costly operation. But even after performing this expensive operation, the disk would be able to read only a single value in a node and based on the value, it again has to rotate the disk’s pointer to the left subtree or the right subtree, further adding to the inefficiency.
Hence, we need a data structure which should have the complexity efficiencies of binary search tree and also should be highly performant in disk operations, capable of reading large number of values in a single disk seek or I/O operation.
B-Trees(Balanced Trees)
B-Trees come to the rescue and are the type of data structure used by most databases for indexing. Each node in a B-tree in a database is typically 4KB in size, although it can be greater depending on the configuration. B-tree can accommodate larger amount of data in a single node, referred as page. When a node is accessed, all the data in the page can be read from disk to memory in a single I/O operation. This significantly reduces the number of disk accesses compared to binary tree.
In the above B-tree, when you want to look up the key 50001, you start at the root node. You find the key such that the left key is less than or equal to 50001 and the right key is greater than 50001. Once you identify this key range, you read the corresponding pointer in the value, which directs you to the appropriate child node. You then repeat this process, navigating down the tree until you reach the leaf node that contains the key 50001 and now you can read the corresponding data record from the page.
The same steps can be used to insert or update a value. You need to find the leaf node based on the key and then insert or update the value in the corresponding page. The database loads the whole page into memory, insert or update the records in that page, and then writes the updated page to the disk.
The database maintains the indexing structure for nodes in fixed-size pages(usually 4KB). What if the page size increases or decreases the configured limit during insert or update? In that case, the database splits the page into two separate pages and redistributes the data between the two newly created pages.
Write-ahead log (WAL)
As we saw above, during inserts and updates, the database loads the entire page into memory, updates the page, and then writes the updated page back to disk. Additionally, splitting the pages involves updating the parent regarding the newly created pages. During the above operations, if the database crashes, it may remain in inconsistent state. To ensure data integrity, database maintains an additional data structure called a write-ahead log(WAL). Every transaction whether insert, update or delete is first written to the write-ahead log before updating the B-tree structure.
In the event of a database crash or failure during an update operation, the write-ahead log can be used to recover and restore the database to a consistent state.
The write-ahead log is also used to replicate the changes to the read replica servers or the standalone servers. This helps distribute read queries across multiple instances and aids in disaster recovery.
Index Organized table
The approach of storing the data in separate pages, which we have seen so far, is known as heap or page organized table. But there is another way to organize data, where we can store the data in the leaf node itself, as in the diagram below, instead of storing in a separate page. This is called index organized table where the primary keys and the associated data are stored together in the same B-tree structure.
This type of primary key where key and data stored together is called clustered index. In some databases such as MySQL's InnoDB engine, clustered index is created by default.
Another efficient approach is the B+ tree structure used in index-organized tables. In a B+ tree, all leaf nodes are linked, as in the diagram below, creating a linked list of B-tree nodes. Index-organized tables, along with the B+ tree structure, can be highly efficient, especially for range queries. If the database reaches the leaf node containing the start range, it can iterate the linked list up to the end range without needing to navigate back and forth to parent and leaf nodes.
Secondary Indexes:
Until now, we have seen how databases create index for primary keys. But in relational model, additional indexes can also be created on columns other than the primary key to improve the performance of read-only queries. These additional indexes are known as secondary indexes.
Databases will create separate B tree for each secondary index created. In heap-organized tables, the leaf node of the B-tree directly points to pages where data resides.
In the case of index organized tables, the leaf nodes points to the corresponding primary key values, referencing the B+ tree of the primary index.
Unlike primary keys, which uniquely identify each row in the table, secondary indexes may have multiple records for the same key. This occurs when multiple rows share the same value for the indexed column. This will result in leaf nodes of the secondary indexes containing pointers to multiple page files or reference multiple primary keys in index organized tables.
Creating multiple secondary indexes can improve the read performance of queries involving those columns, but write performance will take a hit since the database needs to update all the secondary indexes during the data inserts or updates.
Limitations of Relational Model:
Even though relational model is a mature model and has been in use for several years, it has its own limitations.
Each write to a relational database can trigger multiple disk operations, leading to a scenario called write amplification.
For every write, the change has to be written twice, once to the write ahead log and once to the actual page on the disk. And each page write requires reading the whole page, updating the value and writing the page back to disk.
If insert or update causes the page to exceed its fixed size limit(4 KB), then the page has to be split into two different pages, requiring additional writes to both new pages.
In some cases, while inserting a leaf node, the changes has to be propagated and written to the higher level nodes to maintain the balance of the tree.
The next issue is write operations in relational databases requires O(log n) complexity to locate the appropriate page. This issue along with write amplification, can make relational databases less efficient for write-heavy applications. For such applications, Log-Structured Merge (LSM) model can be more suitable, as write, update and delete operations can happen in constant time (O(1)). This efficiency is achieved because these operations only involve appending data to disk sequentially. LSM model in itself is a huge topic and has its own design considerations and needs a separate article altogether.
We have come to the end of our article. We have only scratched the surface of how databases work. Databases also handle numerous other tasks, including concurrency control, transaction management, maintaining logical partitions etc.
Thank you for taking the time to read my article. This is my first one and I’m planning to delve into detailed discussions across various topics in the future. I welcome your feedback and suggestions in the comments section, especially regarding areas where I can improve. Your input will be invaluable as I continue to refine my writing skills and explore new subjects in depth.
References:
Database Internals: A Deep Dive Into how Distributed Data Systems Work by Alex Petrov
Designing Data-Intensive Applications by Martin Kleppmann
Fundamentals of Database Engineering (Udemy) by Hussein Nasser