In a bid to prepare ourselves for projected growth, we are at the moment trying to figure out what part of our system will break at what scale, and how. One step towards this was to also define strict timeouts for our database queries, and eliminate/fix bad queries in the process.
Our requirements were:
- Be able to define different timeout values for different types of servers (app servers, analytics etc.)
- The different limits should be well represented in the code so that they’re easy to discover, even by people who join our team in the future
- It should be easy and quick to modify these limits
We identified multiple sources of our queries. Each of these might need a different query timeout. These sources are:
- App servers: queries that run for our frontend facing APIs, like APIs that our Android app or clients use
- Celery servers: queries made by our celery tasks that run asynchronously
- Cron servers: queries made as part of crons
- Alerts: we have a system that runs SQL queries at configured time intervals, and pushes the data (results of the queries) to relevant people (over Slack)
- Analytics: queries that run as a part of our ETL (v0.1) system
We planned to incrementally reduce the timeouts because at every step/iteration, there will be queries which will not be able to run properly within the planned timeout. We will have to fix all those queries before we reduce the timeout. The incremental limits we defined for each iterations were:
Our backend is built using Django, and to accomplish this we would have to
- Write a raw SQL migration to create the roles (if needed), and
- Alter them to set the appropriate timeout
- Set the database dictionary differently for different server in Django settings with the correct role and passwords
Why not just directly log in to the shell and do this? Because then this change isn’t represented in the code and creates gaps in knowledge over time. But, even though migrations are part of the code, they are just for change management, and rarely does someone go back to migrations to look for “logic” affecting your app’s behaviour.
Since we were planning on having multiple iterations, and there would be a lot of back and forth between the timeout limits while we are experimenting, it would become a hassle to write migrations and apply them every time something had to be changed. This solution was not for us.
We started thinking of a better way to accomplish what we had in mind.
We knew the “ease of configuration” would only come if we can set the timeouts from within Django somehow. Thinking more in this direction and connecting little tidbits we were aware about Django and Postgres, we realized that:
- One can set a timeout using
SETinside a Postgres session which is then adhered to until the end of the current session using:
- Django publishes a
edsignal every time a new database connection is created. This connection is then put in the Connection Pool from where it can be reused (governed by configuration parameters like
Aha! Can’t we just catch the connection as soon as it is created, and set the timeout to whatever we desire for the session? Yes we can 🙂
We also went ahead and set the timeout separately for each
connection.alias. That gives us even more flexibility, we can now set multiple separate timeout values for the connections from the same server as well (for example: set a timeout of 5s for queries made for our Android app facing APIs, except for login API, for which we set the timeout to 1s. And then use
<queryset>.using('<alias>') to use the timeout you want).
The benefit of this approach is:
- Everything is in the code, you can just read and figure out what is happening
- Easy to modify the timeouts
- Since the logic is now in the application code, we can do more stuff with this, like setting a certain timeout only for some percentage of the connections.
- Further, we can set different timeouts for different queries made from the same server
- Since we are running an additional query every time a connection is made, it has some implications. Even though Django’s documentation says that the effect is minor, it is worth checking out if it’s okay for your case
- Since the timeout is set using Django’s signals, it means that wherever Django does not publish a signal, this will not work. One such case is when you are directly logging in to the Postgres shell (or by doing
python manage.py dbshell).
The changes mentioned here gives us more control over our queries and we can selectively restrict our systems in case there is a 🔥 that needs strict actions to be taken to keep the more important parts of the app alive (not the best solution, but sometimes they don’t have to be 😃) .
If you’re a talented developer who believes she’ll be a good fit for Squad — we’re hiring!
(Originally posted on my Medium account)