Write better SQL: In defense of group by 1
Oct 03, 2019
LearnI love group by 1
. And every so often you'll see me use group by 1, 2
. My love for this little snippet of code developed in the summer of 2016. I was a fledgling data analyst, and I was writing BI queries on Redshift.
Here were my two options:
select
date_trunc('month', order_created_at) as order_month,
count(order_id) as n_orders,
sum(revenue) as total_revenue
from core.fct_orders
group by date_trunc('month', order_created_at)
not_group_by_1.sql
select
date_trunc('month', order_created_at) as order_month,
count(order_id) as n_orders,
sum(revenue) as total_revenue
from core.fct_ordersnngroup by 1
group_by_1.sql
Already, we can start to see the problem of not using group by 1
: I have to repeat my code.
What if I wanted to convert the timezone? Here are my two options again:
select
date_trunc(
'month',
convert_timezone('Australia/Sydney', created_at)
) as order_month,
count(order_id) as n_orders,
sum(revenue) as total_revenue
from core.fct_orders
group by date_trunc('month', convert_timezone('Australia/Sydney', order_created_at))
not_group_by_1_even_more_complicated.sql
select
date_trunc(
'month',
convert_timezone('Australia/Sydney', created_at)
) as order_month,
count(order_id) as n_orders,
sum(revenue) as total_revenue
from core.fct_orders
group by 1
still_just_group_by_1.sql
When you're intent on writing out your group by
columns, as the complexity of that first column increases, it gets harder to maintain and read the group by
statement.
In my opinion, the second query is cleaner and easier to read. If I want to know what grain my report is at, it's easy for me to parse the 1
, jump up a couple of lines of SQL and realize it's at the grain of order_month
.
Plus, as someone that writes a lot of SQL, I very much appreciate the fact that 1
is exactly one character.
In late 2018, Redshift introduced lateral column aliasing, meaning I could instead group by order_month
, surely I'd switch to that!
-- post August 2018 option, thanks to lateral column aliasing
select
date_trunc(
'month',
convert_timezone('Australia/Sydney', created_at)
) as order_month,
count(order_id) as n_orders,
sum(revenue) as total_revenue
from core.fct_orders
group by order_month
group_by_order_month.sql
I admit, that's pretty darn readable! And readability trumps DRY code. But yet, I still continue to group by 1
. In fact I have become even more ardent in my love of it. So much so that I may or may not own this keytag:
So why do I still group by 1
? Well, there's a few reasons:
- Not every warehouse that I regularly use supports lateral column aliasing (I'm looking at you BigQuery and Postgres 👀).
- If you happen to alias a column with the name of an existing column, your SQL engine will assume you are referring to the existing column in your
group by
clause (it's a rare case, but happens from time to time when you write enough SQL). 1
is a single character (and I'm lazy, OK?!).- I believe
group by 1
helps you to write better SQL.
Let's dive into that last one a little more.
In my opinion, every reason against group by 1
literally doesn't matter if you write good SQL. In fact, using group by 1
can even help you recognize when your SQL has room for improvement. So let's go through some of the most common objections to group by 1
. But first, let's change up our example to something that highlights this point a little better.
These days, as an analytics engineer, it's much less common that I'm writing a query to aggregate to a reporting period --- I let my BI tool do that aggregation for me. Instead, I am often writing queries to build dbt models, where I group by
a foreign key to calculate some facts about an entity. Here's an example:
select
order_id,
count(payment_id) as n_payments,
sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount
sum(amount) as total_amount
from staging.payments
group by 1
order_payments.sql
Let's use this as the basis of our examples for the rest of the article as we wade through the most common objections to group by 1
.
"But then people write group by 3, 4, 7
, which is awful." --- a luddite
Agreed! That is awful. But in my opinion, the columns you are grouping by should always be listed first in your query.
In this case, group by 3, 4, 7
would help you recognize that you haven't followed this convention. If you had instead grouped by a column name, you might not realize that the order of your columns can be improved.
"But how will I know the grain of my model? 1
doesn't tell me that!" --- probably a flat-earther
I always use a line break to visually break up the columns I'm grouping by, and the aggregates (see the examples above). So my brain uses that to understand the grain.
Further, I also tend to put aggregating queries in a CTE, and use the CTE names to also call out the grain of the model, like so:
with payments as (
select * from staging.payments
),
-- this model is at the grain of one per ordern-- you can tell, because it starts with `order_`
order_payments as (
select
order_id,
count(payment_id) as n_payments,
sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount
sum(amount) as total_amount
from payments
group by 1
)
select * from order_payments
order_payments.sql
Side note: This "all logic in a CTE" pattern is one we use a lot --- it's especially helpful in queries with lots of CTEs, as it helps us chain CTEs together, and is quicker to debug. If you're not yet using it, give it a go and you'll see what we mean.
"But group by 1
ends up in group by 1, 2, 3, 4, 5, 6, 7
, which is hard to read!" --- actually my mortal enemy
I mean, is it really any easier to read group by order_id, customer_id, order_date, status, ...
?
But why is this happening in the first place? Why is a query grouping by so many columns? In my experience, there's two reasons why people group by
more than one or two columns:
- The query is both joining and aggregating
- The query is using a
group by
to deduplicate
And neither of them are actually very good patterns!
1. The query is both joining and aggregating
Consider this SQL, where some fields on fct_orders
are being calculated from payments
:
with orders as (
select * from staging.orders
),
payments as (
select * from staging.paymentsn),
joined as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
count(payments.payment_id) as n_payments,
sum(case when payments.payment_method = 'credit_card' then payments.amount else 0 end) as credit_card_amount
sum(payments.amount) as total_amount
from orders
left join payments using (order_id)
group by 1, 2, 3, 4
)
select * from joined
fct_orders.sql
I mean it looks tidy enough! But, it's hard to extend. What happens if you want to re-order the columns and put your dates at the end? Suddenly, you're breaking the convention of having the columns you want to group by listed first. What happens when you have more columns, or want to join another table to this? It gets really hard to manage the complexity.
Here, there's actually a better pattern --- group first in a CTE, then join it to your table.
with orders as (
select * from staging.orders
),
payments as (
select * from staging.payments
),
order_payments as (
select
order_id,
count(payment_id) as n_payments,
sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount
sum(amount) as total_amount
from payments
group by 1
),
joined as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
order_payments.n_payments,
order_payments.credit_card_amount,
order_payments.total_amount
from orders
left join order_payments using (order_id)
)
select * from joined
fct_orders.sql
It's a little longer, but much more maintainable, and in my opinion, easier to understand what's going on.
(Bonus: this pattern is also far more performant!)
(Bonus bonus: if you're using dbt, consider breaking this CTE out into a separate model so you can test that order_payments
only has one record per order!)
2. The query is using a group by
to deduplicate
Consider this group by
that's being used to deduplicate source data to be one record per order_id
.
with orders as (
select * from staging.orders
),
deduplicated as (
select
order_id,
customer_id,
status,
max(order_date) as order_date
from orders
group by 1, 2, 3
)
select * from deduplicated
stg_orders.sql
OK, sure that's not so bad. But what if your table has a dozen columns? That's a lot of columns to group by. And what if the status
changed values? You'd end up with two records for the same order_id
, so you won't have effectively deduplicated!
Generally, group by
to deduplicate is a pretty limited pattern. Here are two other patterns we prefer --- heads up: they both use window functions.
Option 1: Create a row_number
and use a where
This row_number
pattern is really useful for wide tables as it extends for extra columns really well!
with orders as (
select * from staging.orders
),
orders_with_row_number as (
select
*,
row_number() over (
partition by order_id
order by order_date desc
) as row_number
from orders
),
deduplicated as (
select * from orders_with_row_number
where row_number = 1
)
select * from deduplicated
stg_orders.sql
Option 2: Use distinct
with window functions
This pattern is a little tricky, but it can come in really handy for narrow tables, like creating a two column table to map one user identifier to another (the example below is still for the orders table though). It's one I only learned recently!
with orders as (
select * from staging.orders
),
deduplicated as (
select distinct
order_id,
last_value(customer_id) over (
partition by order_id
order by order_date
rows between unbounded preceding and unbounded following
) as customer_id,
max(order_date) over (
partition by order_id
) as order_date
from orders
)
select * from deduplicated
stg_orders.sql
We deviated a little from talking about group by
here, but it's important to note that in my opinion, the concern about grouping by too many columns is more a symptom of a poorly structured query than a problem with the group by 1
syntax. As a rough guess, I'd say that 95% of my group by
statements are group by 1
, and most of the rest are group by 1, 2
. It's very rare that I group by more than that.
Final thoughts
OK, I've intentionally written this to ruffle a few feathers, and I'm fully aware that some SQL dialects don't support group by 1
(e.g. mssql).
Opinions can change --- for example, I used to be very pro-UPPER-CASE-FOR-SQL-KEYWORDS, and even, pro-leading commas.
In writing all of this out, I think I've even softened my opinion on group by column_name
a little (heck, maybe it is better? When the from
and group by
are next to each other, maybe you can parse it a little more quickly. After all, brain time is the most expensive resource of all! Though, it is fewer characters...)
At the end of the day, be opinionated in your SQL style, put it in a git repo and stick with it consistently. Here's ours, and here's another one I love --- yes the latter recommends group by column_name
and I still love it.
In all honesty, I can't wait for the day that someone builds a really good SQL linter based on a definable set of conventions, so that no matter which conventions you adopt, it's the job of a computer (and not me) to rant about breaking them.
{% endrant %}
Last modified on: Oct 15, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.