Amazon Aurora - A Deep Dive - Part 1

Jul 10, 2024

a group of cubes hanging from a ceiling — Photo by Shubham Dhage on Unsplash

Amazon Aurora is a powerful relational database service offered as part of Amazon Web Services (AWS). Aurora optimizes MySQL and PostgreSQL databases and existing applications which uses MySQL and PostgreSQL can migrate to Aurora without any changes in their code.

Amazon aurora offers multiple guarantees.

Performance: Offers 5x performance improvement over MySQL and 3x the performance of Postgres.
Storage: Automatically scales database storage capacity, reaching up to 128 TB (128,000 GB).
Read Replicas: Supports up to 15 read replicas with sub-10ms replica lag.
Failover: Ensures nearly instantaneous failover. Aurora automatically detects a primary instance failure and promotes a read replica to primary within 30 seconds.
Self-Healing: Continuously scans and repairs data blocks and disks for errors.

and much more.

In this article we will explore how Amazon Aurora’s innovative architecture effectively addresses the limitations of relational databases in distributed environments and provides the above guarantees.

Previously, we discussed the internals of relational databases, focusing on how data is layered on a single disk and how primary and secondary indexes are stored and linked to the actual data records. You can refer to that discussion in the article below.

Relational Databases - A Deep Dive

Ramkumar Ramasamy

May 28, 2024

Read full story

We will explore in detail how Amazon aurora enhances the below aspects:

Improvements in Quorum
Segregate Workloads
Segmented Storage
Storage Internals
Replication Mechanism
Handling of Writes, Reads, Commits, and Replicas

In the first part of the article, we will focus on the first three points listed above. We will begin with an introduction to replication, scaling and how traditional relational databases handles replication.

Replication:

In most of our applications, we use a database to persistently store data. All the data received by the application is saved in the database and can be fetched later whenever needed.

However, what happens if the server hosting our database shuts down or the disk containing the data becomes corrupted? We could lose all our data and our whole application is useless without the data. This scenario is also known as single point of failure.

To avoid this, applications store the backups of the data on a server in a different location. If the main database fails, the application can switch to the backup server. This process is called replication.

Scaling:

As our application grows in popularity and handles increasing amounts of data, the database may struggle to manage the high volume of read and write requests.

We can address this by two ways:

Vertical Scaling: Adding more CPU, RAM, and storage to the existing server.
Horizontal Scaling: Adding more servers to distribute the load.

When it comes to vertical scaling, there is a limit to how much you can add resources to a single machine, and if that server fails, our application will be offline until it’s back up. This is why many mission-critical applications prefer horizontal scaling.

Single leader replication:

Relational databases are often configured with a master-slave configuration. A single master server to handle all the write requests (write instance) and multiple read instances (read replicas) to manage read requests.

If multiple servers handles writes, then the database have to deal with all the complexities and errors associated with enforcing ACID (Atomicity, Consistency, Isolation, Durability) properties across multiple write servers. For example, if a transaction involves data stored in multiple instances and if any one of the instance fails while committing the transaction, part of the transaction might persist while the other part is discarded, leaving the database in an inconsistent state.

Improvements in Quorum

In this section, we will explore the concept of quorum and how the Aurora team has enhanced traditional quorum mechanisms to make the Aurora database highly available.

Let's start with a basic replication setup: a database with a single master instance and two read replicas. Suppose we have a variable in our database 'x' which is currently set to 2.

When an application 'A' issues a write request to update 'x' to 3, the new value is first persisted in the master instance and then replicated to read replica 1.

However, while the update is being replicated to replica 2, a network issue occurs, and the replication fails. Meanwhile, another application 'B' issues a read request for the same variable 'x' from read replica 2. Due to the replication failure, read replica 2 incorrectly returns the old value of 'x' instead of the new updated value.

There is an inconsistency introduced here due to the data getting replicated to multiple instances. To solve this inconsistency, a concept called quorum is introduced.

A quorum is a minimum number of instances a write or read requests should be completed before it is considered successful.

A write quorum is the minimum number of instances that must acknowledge a write operation before it is considered successful.

A read quorum is the minimum number of instances that must be contacted and return a response for a read operation to be considered valid.

If there are n instances of replicas, the quorum implies that the the total number of reads and writes should be greater than n.

r + w > n

For example, with 3 instances, the number of writes and reads should be completed in minimum 2 instances.

2 + 2 > 3

A write operation is considered successful if the variable 'x' is stored successfully in at least 2 instances. Similarly, for a read operation to be successful, the variable 'x' must be read with the same value from at least 2 instances.

In the example described below, both the read and write requests are successful because it able to satisfy the quorum requirements. Quorum ensures that the database is consistent and the application retrieves the most up-to-date data.

But team that developed Amazon aurora believes that the existing approach of quorum is inadequate in large cloud deployments like AWS. To understand the reason, first, we will explain how AWS divides its data centers. The primary criterion is regions. AWS comprises of multiple regions, each located in a separate geographical area. Each region contains 3 to 6 availability zones (AZs), which are physically separated and isolated data centers. Each availability zone has its own independent power and networking infrastructure. Each AZ in a region is connected with high speed low latency network link.

Let's consider that our database is deployed in the us-east-1 region, with the master instance located in AZ us-east-1a and the replicas in AZ us-east-1b and us-east-1c.

If AZ us-east-1c goes down due to a power issue, our read and write queries will still succeed since they can write to AZ us-east-1a and us-east-1b. However, in a large cloud environment like AWS, which maintains hundreds of thousands of servers, multiple servers might be down or undergoing repairs at any given point in time. In this scenario, if AZ us-east-1c is down and the server which hosts our database fails in either us-east-1a or us-east-1b, then read or write queries may not be successful.

Hence, In Aurora, they have chosen a design model of AZ + 1. Instead of maintaining 3 replicas, they keep 6 copies of the data by placing 2 copies of the data in each availability zone.

For a transaction to be successful, this model requires a read quorum of 3 out of 6 and a write quorum of 4 out of 6 successful replications to satisfy the equation - r + w > n, where n is the total number of replicas. Write quorum of 4 guarantees higher reliability for writes, while the read quorum of 3 ensures faster reads without sacrificing consistency.

Therefore, write operations can withstand the loss of any two nodes, including a single AZ failure, while read operations can endure the failure of an entire AZ plus one additional node.

Segregate workloads

In the previous section, we saw how the Aurora team improved availability by enhancing the quorum mechanism. In this section, we'll discuss how Aurora storage scales automatically with increasing data volumes.

Databases typically handle two types of workloads:

Compute: These are CPU and RAM-intensive tasks, including query processing, transaction management, data security, and encryption.
Storage: These are disk and I/O-intensive tasks, such as storing tables, indexes, BLOBs (binary large objects), and stored procedures. Storage workloads ensure that data is durably stored, backed up, and can be recovered in case of failures.

Consider a scenario where our application experiences a very high surge in read activity, resulting in a huge volume of read queries to our database. To support this, we plan to create another read replica. This involves provisioning another server and copying all the data in the storage to the new server, a process that can be very time-consuming. However, our situation specifically calls for an increase in the compute resources, such as CPU and RAM, but we end up in performing the time-consuming task of copying the entire data to another replica.

Amazon Aurora separates the compute and storage layers, managing them on different servers. This separation allows each layer to scale independently:

Compute Scaling: If the workload becomes more compute-intensive but doesn't require additional storage, we can quickly allocate more compute resources (CPU and RAM) without scaling up the storage capacity. This means we can add read replicas without duplicating data storage.
Storage Scaling: Aurora automatically scales storage as the data volume increases. This ensures that the storage layer grows as needed without manual intervention.

Moreover, we have the ability to optimize each layer independently. For example, by adjusting resource allocation specifically for compute or storage.

Additionally, this design enhances fault isolation as well - the failures or issues with one component are less likely to impact the other.

Segmented storage

Earlier we discussed that in a large cloud environment like AWS, which maintains hundreds of thousands of servers, multiple servers may be down or undergoing repairs at any given time. To ensure high availability and fault tolerance, we need to either prevent server failures or minimize the repair time.

There are two key terminologies in server management in distributed systems.

Mean Time to Failure (MTTF): The average time a system or component operates before it fails.
Mean Time to Repair (MTTR): The average time required to repair a system or component and return it to operational status after a failure.

For a distributed system to work properly, MTTF should be high compared to MTTR. However, in the case of cloud providers like AWS, which maintain a vast number of servers, increasing MTTF on independent servers is challenging because any server can fail at any time for a variety of reasons. Therefore, Aurora team focusses on reducing the MTTR instead.

The storage layer in the database is divided into multiple, small, fixed-size segments. Each segment is 10GB in size. For example, if our database is split into three segments like below:

Each 10GB segment is replicated six times to meet the AZ + 1 quorum requirements. These replicas are grouped into Protection Groups (PGs), with each PG consisting of six 10GB segments.

The segments in a PG are distributed across three different Availability Zones (AZs). Each AZ contains two segments from the protection group.

Aurora storage volume is constructed by concatenating these individual Protection Groups (PGs).

Aurora continuously monitors each segment and since a segment is only 10 GB in size, if it fails, another similar segment can be quickly copied and repaired.

Each storage instance is connected with a high-bandwidth network link. So replication of a 10GB segment on a 10Gbps network link can be completed within 10 seconds. This rapid replication ensures Aurora's self-healing capability.

According to the multiple tests by the Aurora team, in order for the Amazon Aurora to lose quorum, the following scenarios must occur at the same time within the same 10-second window:

Two segment failures.
The failure of an entire availability zone that does not contain the either of the segments mentioned in the above point.

Based on the experience of the maintaining numerous servers by AWS team, such scenario is highly unlikely.

Aurora leverages its architecture for planned outages, such as OS and security patching and software upgrades. During these events:

One segment is marked as bad, and updates are executed one AZ at a time.
This ensures that no more than one member of a protection group is being patched simultaneously.

In the second part of the article, we will explore how the Aurora team enhanced the internals of database storage, minimized data amplification during the replication process, and optimized standard operations such as writes, commits, reads, and replication.

References:

https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmod-17.pdf
https://www.udemy.com/course/aws-certified-solutions-architect-associate-saa-c03
AWS re:Invent Aurora videos
Thanks for reading Demystifying Tech: Unveiling Fundamentals! Subscribe for free to receive new posts and support my work.

Demystifying Tech: Unveiling Fundamentals

Amazon Aurora - A Deep Dive - Part 1

Relational Databases - A Deep Dive

Improvements in Quorum

Segregate workloads

Segmented storage

Discussion about this post