Hack Django ORM to calculate Median and Percentiles (Or make annotations great again!)
July 08, 2017 • programming
Hacks should be quick. And so should be the articles about them.
Photo by Caspar Rubin on Unsplash
Problem
We needed to calculate medians, percentiles for some quantities for our own ETL system (note to self: write a post on this) grouped by certain fields.
Some options we had:
- Extra: But Django says this will be deprecated, and use it as a last resort. We still had resorts to explore.
- Raw SQL: Besides all usual bad things with writing RAW SQL (look at the number of warnings on the linked page!), the code starts to look ugly.
So what was the best way to do it?
Come on! Django also gives us something called a RawSQL. Great. So we can just use it to get the percentiles we wanted. Right?
Wrong. As we realised later, RawSQL is better suited for aggregates and not annotations. Exhibit:
q = MyModel.objects.values('some_fk_id').annotate( | |
avg_duration=Avg('duration'), | |
perc_90_duration=RawSQL('percentile_disc(%s) WITHIN GROUP (ORDER BY duration)', (0.9,)), | |
) | |
print q.query | |
# SELECT "some_fk_id", | |
# AVG("duration") AS "avg_duration", | |
# (percentile_disc(0.9) WITHIN | |
# GROUP ( | |
# ORDER BY duration)) AS "perc_90_duration" | |
# FROM "mymodel" | |
# GROUP BY "some_fk_id", (percentile_disc(0.9) WITHIN | |
# GROUP ( | |
# ORDER BY duration)) |
Notice how our Raw expression percentile_disc(0.9) WITHIN GROUP (ORDER BY duration)
also gets added to the GROUP BY
clause?
This would not happen if we remove the Avg("duration")
from annotation. So basically, if the query already has a GROUP BY
clause, RawSQL
will add the sql
to the GROUP BY
clause as well.
This is not what we want. It also didn’t make sense to us, why is that needed? Maybe when we want to use RawSQL
in an order_by
and want the expression to get added to GROUP BY
automatically? Maybe.
Solution
We dug deep as to why is the sql added to GROUP BY
. Looked at the source code, found this method get_group_by_cols
which returns self
. Super sensible naming by Django devs. I knew we could do something here. Ergo, the Hack:
class RawAnnotation(RawSQL): | |
""" | |
RawSQL also aggregates the SQL to the `group by` clause which defeats the purpose of adding it to an Annotation. | |
""" | |
def get_group_by_cols(self): | |
return [] | |
# The Query | |
q = MyModel.objects.values('some_fk_id').annotate( | |
avg_duration=Avg('duration'), | |
perc_90_duration=RawAnnotation('percentile_disc(%s) WITHIN GROUP (ORDER BY duration)', (0.9,)), | |
) | |
print q.query | |
# SELECT "some_fk_id", | |
# AVG("duration") AS "avg_duration", | |
# (percentile_disc(0.9) WITHIN | |
# GROUP ( | |
# ORDER BY duration)) AS "perc_90_duration" | |
# FROM "mymodel" | |
# GROUP BY "some_fk_id" |
We created a class RawAnnotation
and overrode get_group_by_cols
to return an empty array. And now it works as expected.
Yay.
(Originally posted on my Medium account)