Skip to content

Commit 9b69c08

Browse files
authored
docs: explain deferred usage of aggregations and sorting (#853)
docs: explain deferred usage of aggregations and sorting
2 parents ca216b3 + b427eab commit 9b69c08

File tree

1 file changed

+42
-1
lines changed

1 file changed

+42
-1
lines changed

docs/using-gitbase/optimize-queries.md

+42-1
Original file line numberDiff line numberDiff line change
@@ -328,4 +328,45 @@ This will pretty-print the analyzed tree of your query. If you see a node named
328328

329329
- `commit_files.file_path = files.file_path`
330330
- `commit_files.tree_hash = files.tree_hash`
331-
- `commit_files.blob_hash = files.blob_hash`
331+
- `commit_files.blob_hash = files.blob_hash`
332+
333+
## GROUP BY and ORDER BY memory optimization
334+
335+
The way GROUP BY and ORDER BY are implemented, they hold all the rows their child node will return in memory and once all of them are present, the grouping/sort is computed.
336+
In order to optimise a query having an ORDER BY or GROUP BY is important to perform those operations as late as possible and with the least amount of data possible. Otherwise, they can have a very big impact on memory usage and performance.
337+
338+
For example, consider the following query:
339+
340+
```sql
341+
SELECT LANGUAGE(f.file_path) as lang, SUM(ARRAY_LENGTH(SPLIT(f.blob_content, "\n"))-1) as lines
342+
FROM ref_commits rc
343+
NATURAL JOIN commits c
344+
NATURAL JOIN commit_files cf
345+
NATURAL JOIN files f
346+
WHERE rc.ref_name = 'HEAD'
347+
AND f.file_path NOT REGEXP '^vendor.*'
348+
AND NOT IS_BINARY(f.blob_content)
349+
GROUP BY lang
350+
```
351+
352+
This query returns the total number of lines of code per language in all files in the HEAD reference of all repositories. What happens here is that grouping will be done with a row that contains `blob_content`. This means a lot of data will be kept in memory to perform this aggregation. That could lead to tens of gigabytes of RAM usage if there are a lot of repositories in the dataset.
353+
354+
Instead, the following query returns exactly the same rows, but only outputs what's necessary in a subquery, keeping way less data in memory.
355+
356+
```sql
357+
SELECT lang, SUM(lines) AS lines
358+
FROM (
359+
SELECT LANGUAGE(f.file_path, f.blob_content) as lang,
360+
(ARRAY_LENGTH(SPLIT(f.blob_content, "\n"))-1) as lines
361+
FROM ref_commits rc
362+
NATURAL JOIN commits c
363+
NATURAL JOIN commit_files cf
364+
NATURAL JOIN files f
365+
WHERE rc.ref_name = 'HEAD'
366+
AND cf.file_path NOT REGEXP '^vendor.*'
367+
AND NOT IS_BINARY(f.blob_content)
368+
) t
369+
GROUP BY lang
370+
```
371+
372+
As a good rule of thumb: defer as much as possible GROUP BY and ORDER BY operations and only perform them with the minimum amount of data needed.

0 commit comments

Comments
 (0)