Franz Buchsbaum
Imagine you have built a brand new web application, which is finally getting traction. But as customers keep streaming into your platform, the site begins to buckle. Nothing to worry about, you are prepared - the app has load balancers, is containerized, and the architecture allows scalability. A no-brainer!
But guess what? The database cannot keep up with the content your users are creating and consuming, and scaling it will probably mean some downtime. This is a tale as old as the internet - your code scales, but your database doesn't. How can we solve this problem?
First things first
Scaling horizontally (or "scaling out") means adding more instances of an application or service to share the load. Conversely, scaling vertically (or "scaling up") is about adding more resources, like CPU power or memory, to an instance. With that understood, let's proceed.
Why are databases hard to scale?
You can endlessly scale your web application code horizontally on any major cloud platform or even in your in-house setup. Why can't you do the same for your database?
Probably the most critical reason is the state. The underlying protocols of HTTP, most web apps, and APIs make them stateless, enabling them to be easily scalable independently. But no major application can live without a state. You want to know about the logged-in users, what they have ordered from your website, and your users might want to read the article reviews left by other users.
This information must be stored and made available to the entire application, irrespective of the number of its instances. The information should also be up-to-date and in sync. This means we will put it into a database. The good news is, you now have a single source of truth. The bad news: you now have a single point of failure.
Coming back to the initial question: Why are databases hard to scale? That's because you cannot throw more instances of the database problem. Those instances need to be in sync, and for this, they need to talk to each other constantly. Transactions are just harder to handle on a distributed system.
Let's talk about the most prominent solutions to these problems.
The classical approach: Replication
People soon realized that for most applications, the database information is read more often than written. For the business side, the write requests may be more important than the read requests.
Based on this observation, the notion of a "read-replica" was introduced. In this approach, one single instance is responsible for all write operations (often called the primary or main database), and one or more instances are solely responsible for handling reads. If traffic is routed accordingly, you can have more resources dedicated to the most frequent requests. Congratulations, you can now scale your reads horizontally!
While this is a simple approach, not too different from your typical caching mechanisms, it also introduces new complexities:
- The main instance must keep all the replicas up-to-date. If the replication process fails or is too slow, you might end up showing some of your users' incorrect data.
- Your application needs to know where to send its write and read requests. This means if it wasn't accounted for when it was written, large parts of your application code would need to be re-done for this strategy to work.
- Your main database is still a single point of failure. If it goes down or is overloaded, you will have a bad time.
- There might still be overhead in scaling. How does your application know when a new reader is available or the scaled-down one is no longer present? What if the reader you are talking to fails and goes down?
Some databases now provide the possibility to have several main instances that can share writes and sync them. But most of them are either dedicated cloud offerings or third-party products and, therefore, need more work to set them up. Now, let's see what we can do about replica failover.
It all balances out
No matter how many instances of your web application are serving users right now, you want them to be available at the same address. This is usually handled by the reverse proxy, a server that listens to your address and forwards requests to an appropriate instance of your app. Sometimes, that proxy will also have decision logic to better distribute connections across the available application instances and act as a load-balancer.
This sounds great, but can we use it for our read replicas? It depends. Most reverse proxies handle HTTP requests, but most databases do not speak HTTP. We will have to go deeper. Software like Nginx and HAProxy can do forwarding on TCP level, good enough for most databases.
On a basic level, this allows us to distribute calls to our read replicas without the application needing to account for it. They can also do some trivial failover on conditions like whether the server is reachable on the network level. You can add or remove instances without having to make any changes to your application. You can even leverage connection pooling. We still need to split read from write calls on our thought, so the complexity on code-level remains. What can we do about it?
Are you aware?
Imagine for a moment that your proxy is protocol-aware and understands the protocol your database is speaking. What could we do with this? As it turns out: A lot!
If your proxy can understand how the connections work on a protocol level, it can determine if a read replica is out of sync or unable to react, even when the network connection is working. It can route based on how high the load is on the replica and not just on the number of connections.
Suppose your proxy can understand the queries sent to your database. In that case, it can automatically (or well, on your discretion and configuration) determine which ones go to your main instance and which ones make it to one of your read replicas. No more having to take care of this in your application; it could all work transparently and out of the box.
We can do even fancier stuff:
- Was the query timed out because it was running on a replica that had a high load and could not answer in time? Re-run it, maybe on another replica, and report it back to the application only then.
- Or do we know that queries from that specific user take longer than our global timeout of 3 seconds? Let's allow it for that user.
- Or maybe even just for that one specific query? A query that comes in every other second and produces the same result? Let's cache it.
- Having all your queries go through a proxy might be an excellent opportunity to sort out which queries, users or database schemas have the most significant impact on your application.
Some proxies can do a few or all of these things. MySQL (and MySQL-compatible) servers have an open-source project called ProxySQL . MySQL itself provides a tool known as MySQL router. MariaDB has its tool called MaxScale. If you want dedicated business support, go for Heimdall Data Proxy. Even AWS is now slowly jumping on the train with its offering of RDS Proxy.
Aside: "Just put it in a container"
As we discussed, databases are all about state. The typical container platform is not. It is supposed to be stateless, and each container should be replaceable.
There is some progress in persistent volumes and claims, but some inherent difficulties exist if you do not choose a database where each instance is valued equally, as in multi-main or sharded setups (ElasticSearch, MongoDB). It will probably be fine if one of your read replicas is restarting because containers are being shifted around the cluster. If it happens to your main database, the chances are that you won't be too happy. Even though there are some promising projects (e.g., CrunchyData's postgres-operator or PressLabs' mysql-operator), you might want to hold off having your relational database inside your Kubernetes cluster for now.
It can already have its place in non-production environments. It is probably helpful for development environments, integration, and automated tests to have a newly spun up database container reset to the initial state every time!
Conclusion
As we go into the future, many increasingly better offerings of protocol-aware database proxies and other helpful tools can make scaling your database less cumbersome and transparent to your codebase.
Various projects at Nagarro have proved this approach as a viable solution if your database is the bottleneck and if optimizing your queries or building better indices did not provide the desired results.
Want to get more out of your database but don't know where to start? Need support with a specific solution? Reach out to us and get solutions customized to your individual setup!