Two years with Celery in Production: Bug Fix Edition
December 17, 2017 • programming
Photo by Martin Oslic on Unsplash
Update 6th Mar, 2022: My friend and colleague Ayush Shanker recently published a follow-up post for most of the problems mentioned in this article. I recommend going through that post as well (or maybe just that post): ”Celery in production: Three more years of fixing bugs“.
As mentioned in an earlier post, we rely on Celery for publishing and consuming tasks to/from our RabbitMQ (RMQ) broker. We are very happy with the whole setup and it works reliably for us. But this wasn’t exactly the case up until four weeks back. We were plagued with a plethora of issues which we hadn’t got down to fix, mostly because first, the number of issues was small, and the occurrences less frequent.
But as the frequency and the types of issues increased, we started spending a lot of time in maintenance. I personally would have to keep an eye on the RMQ admin to see which queues weren’t being consumed properly, see if workers (we will use “workers” for celery workers or consumers) were still up or not stuck in an infinite restart loop. And every time I noticed something abnormal, I would have to restart the rogue worker/s manually.
The Issues
Here is a list of issues that we identified, and fixed:
- Worker servers always had an unexplainably high RAM usage
- Worker servers always had an unexplainably high CPU usage
- Workers stayed idle, not consuming any tasks
- Workers kept on restarting
- Publishers did not adhere to
task_publish_retry_policy
Let’s go over them one by one
Worker servers always had an unexplainably high RAM usage
Tl;dr: We had orphaned child processes still running. Use
stopasgroup
if you are usingsupervisord
.
UPDATE: After a month of writing, this is still happening, albeit rarely and less severely. A final solution is to regularly (crons, maybe) kill the processes. This is also hinted at in Celery’s documentation.
We noticed our worker servers always having 100% RAM usage, and noticed that processes for old workers were still alive. This became apparent because we had very recently changed the configuration and we could see processes for workers with the old configuration (they were listening to inexistent queues). This was a red flag.
We use supervisor
to control the celery workers and have to do a reread
every time we change the worker config. On searching, we found out that when we reread
and update
or when we restart
, supervisor
doesn’t kill the old processes. This was also raised in an issue on supervisor. As suggested, we started doing stop
and start
instead of a restart
(in hindsight this looks dumb now since all restart
does is stop
and start
) and before reread
s as well. Of course, this didn’t help things.
A ps auxf
output (after removing unneeded info) from one of our servers clearly showed how there were old celery processes still hanging out.
USER START TIME COMMAND
ubuntu 03:38 0:57 \_ celery worker A
ubuntu 03:38 0:30 | \_ celery worker A
ubuntu 03:38 0:32 | \_ celery worker A
ubuntu 03:38 0:32 | \_ celery worker A
ubuntu 03:38 0:31 | \_ celery worker A
ubuntu 03:38 0:58 \_ celery worker B
ubuntu 03:38 0:31 \_ celery worker B
ubuntu 03:38 0:31 \_ celery worker B
ubuntu 03:38 0:31 \_ celery worker B
ubuntu 03:38 0:30 \_ celery worker B
ubuntu Sep09 1:59 celery worker C
ubuntu Sep10 3:04 celery worker D
It was clear because:
- We had restarted all the celery workers quiet recently, as is visible by the
START
column, while the other two older processes are at least a day old. - How celery, roughly, works is that we start a parent process that starts more child processes (depending on the
concurrency
) and maintains a pool of these workers. This is consistent with how the newer workers are depicted, while the two older ones have no parent/child processes.
On further research, I came across an issue in celery that mentioned the same problem.
Turns out, celery parent processes don’t propagate the STOP
signal to its child processes, leaving them orphaned (these are the old workers we saw in our ps
output above). This got fixed by using stopasgroup
in the supervisord
config. As documented, with stopasgroup
set to true
, supervisor sends the STOP
signal to the whole process group.
Worker servers always had an unexplainably high CPU usage
Tl;dr: Our
--max-memory-per-child
was set too low, we moved back to using--max-tasks-per-child
.
We would be okay with this if the tasks were CPU intensive or the rate or volume of task consumption is high. They were not. It was just simple get and set on the database and the rate was about 17 tasks per second when we had ~12 workers spread over 4 servers.
To figure out what is going on, I ssh
ed into a server with ~100% CPU usage, and did:
> ps aux -sort -%cpu
Found a worker process that was taking ~48% of the CPU. To check what was happening with that worker, I checked the logs generated by that worker. The logs showed that the worker was being killed after every 3–4 tasks because of reaching the memory limit (we were starting our workers with a --max-memory-per-child
of some 200–300mb). Although it did look like a sane value to us, maybe we misunderstood how this limit works.
The logs were similar on other servers. This would explain the high CPU usage. Killing and creating processes would take CPU. And this would also explain the slow rate of execution. So, on a hunch, I removed this limit from the workers and moved back to using --max-tasks-per-child
.
Instantly the rate increased to ~250 tasks per second (from 17) and the CPU usage also settled down. Huge win. Memory leaks are still covered because of the limit on the number of tasks.
Workers stayed idle, not consuming any tasks
Tl;dr: There was a deadlock because of a play between
psycopg2
andssl
. Updatingpsycopg2
to2.6.1
fixed the issue.
The queue had available tasks, and healthy consumers, but the tasks weren’t being consumed. This would happen every 2–3 hours for some particular workers. We had to manually restart the hanging workers and quickly became a headache.
One hypothesis was that a particular task was making the workers hang, since the RMQ admin showed picked but unacked
tasks whenever the workers used to hang. Checking the logs for the hung workers confirmed that the workers had started a particular task but not completed it. Double checked this by looking at the active tasks for hung workers by doing:
> celery -A squadrun inspect active
Solution for this, as mentioned on the internet, was to periodically restart the workers. But this looked like a hack and not the solution to me. But we weren’t getting any further headway into this. On a hunch, again, we decided to use strace
on a hung worker to actually see what a worker was stuck on. This turned out to be a good idea:
> sudo strace -p 27067
Process 27067 attached
futex(0x33372e8, FUTEX_WAIT_PRIVATE, 2, NULL ^C
Process 27067 detached
On searching about what futex
and FUTEX_WAIT_PRIVATE
means, and checking similar issues with celery, we stumbled upon a similar issue as ours.
From one of the comments on the issue:
Are you using Postgres in any way? Because that’s exactly what was happening with us. Postgres registers a locking callback with ssl, which all ssl requests use. Unfortunately Postgres unloads the callback after closing its connections, which is fine for Postgres, but any other ssl consumer (requests included) would be unable to release the lock, allowing a thread to deadlock itself if it tries to reacquire its own lock.
As mentioned, updating to psycopg2==2.6.1
(we were at 2.5.2
) fixed the issue.
Workers kept on restarting
Tl;dr:
celery==4.0
had a bug, which got fixed inv4.1.0
. But we went ahead with our own fork and applied the patch to it.
Sometimes, we would have our workers continually restarting. This was a problem on our staging
environment though, and, when nothing worked, emptying the RMQ queue once, and then publishing the tasks fixed the restarting worker.
To figure out what was going wrong, we again went back to the logs, and saw the following exception and traceback:
2017-09-09 16:48:44,188 [CRITICAL] celery.worker: Unrecoverable error: TypeError("'NoneType' object is not callable",)
Traceback (most recent call last):
File "celery/worker/worker.py", line 203, in start
self.blueprint.start(self)
... ...
File "billiard/pool.py", line 1487, in apply_async
self._quick_put((TASK, (result._job, None, func, args, kwds)))
TypeError: 'NoneType' object is not callable
This one was the easiest problem to figure out. Looks like when there are tasks already in the queue, and a worker is consuming from multiple queues, this bug makes an appearance. They start consuming from the queue even before the queue is ready. This bug was fixed in celery==4.1.0
. We were at v4.0
. Before making the update though, we made sure there were no regressions or new bugs introduced in v4.1.0
, we came across one major one. This bug made us less confident of making the update, because v4.1.0
was out fairly recently, and testing it thoroughly to make sure nothing else was broken would have taken a long time.
So, we forked celery and applied the patch that fixes the bug we were facing . with v4.0
, and we were good to go!
Publishers did not adhere to “task publish retry policy”
Tl;dr:
task_publish_retry_policy
is broken inkombu==4.1.0
, we downgraded to4.0.2
.
Just when we thought we were done with bugs and issues caused by celery and other related libraries, another made appearance. Somewhere amongst all this, we had also updated kombu
to v4.1.0
.
Now, for one of our APIs, we wanted the celery publisher to raise an exception as soon as the connection couldn’t be established while publishing. But it wouldn’t just happen. The worker will go to sleep(1)
(as found out by the traceback when we killed the process manually) if a connection wasn’t established, and do this infinitely. We tried setting a stricter retry policy explicitly while publishing the task using apply_async
but it didn’t work out.
Finally, we found out that task_publish_retry_policy
was broken in kombu==4.1.0
. We downgraded to 4.0.2
and things started working as expected.
Takeaways:
- Logs made it easy to debug issues. Long live logs.
- Test library updates before applying them to production, because great developers make mistakes too :D
A big thank you to Tarun Garg, my friend and colleague, for helping with the fixes :)
(Originally posted on my Medium account)