One of the things that is particularly fantastic about relational databases is the way it allows you to craft queries and aggregate the data in a useful way.
By adding a GROUP BY
clause, we transform the query we’re running from something that fetches pure data to something that aggregates it. There’s a few performance gotchas here, a couple of neat tricks and an issue that consistently occurs
that shows you haven’t modelled your query right.
Any query in a GROUP BY is still limited by the performance constraints you’d expect from a query without it. If we had a table that looked like this:
CREATE TABLE `People` (
`id` bigint NOT NULL,
`name` varchar(255) NOT NULL,
`age` bigint NOT NULL,
`country` varchar(255) NOT NULL
)
CREATE INDEX idx_country ON People (country);
SELECT age, count(*) FROM People WHERE country='United Kingdom' GROUP BY age;
It has the same underlying performance characteristics as:
SELECT age FROM People WHERE country='United Kingdom';
Differences in the time taken for this data to return would purely be about the data transfer between the client and the database - in both cases, the database is having to pull the same amount of data off disk.
One neat thing about aggregations is that for specific queries, you can pull all the data purely off the index without having to touch the actual table. If you instead had an index that looked like:
CREATE INDEX idx_country_age ON People (country, age);
There’s enough data purely on ths index to solve the query asked for it.
This is largely mitigated now, as MySQL complains about it when you write queries that do this incorrectly - but you still get people confused about it and turning off the protection against it.
Let’s say we have this table:
id | name | age |
---|---|---|
1 | Jordan | 20 |
2 | Amy | 30 |
3 | James | 30 |
Then we run this query, what’s the expected response?
SELECT age, name FROM People GROUP BY age
…the answer is an error - because we’re asking a question of the database which we can’t deterministically answer - it could hypothetically return Jordan, 20 happily - but it can’t know whether to return Amy or James as the value for age 30.
Up until MySQL 5.7, MySQL would just return a piece of data, not caring about which piece of data it will return. This led to a bunch of complicated queries being made without anyone caring about the fact there was no consistency.
If you truly don’t care about which bit of data you want when you’re doing this, you can run:
SELECT age, min(name) FROM People GROUP BY age
This makes the query deterministic, so it’ll be happy again - but bare in mind that in this example it will then have to read all the names in order to work out which one is “min”.
Managing to aggregate a bunch of different data from a dataset can be very slow. Let’s say you’ve got a million events from the last month and you want to gather some aggregated statistics about it. Here’s an example schema:
CREATE TABLE `Events` (
`id` bigint NOT NULL,
`event` varchar(255) NOT NULL,
`state` varchar(255) NOT NULL,
`timestamp` bigint NOT NULL
)
Let’s say you have 1000 rows of data and you want to know a bunch of different data on it. For this example, we’ll have something that looks