You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/using-gitbase/optimize-queries.md
+42-1
Original file line number
Diff line number
Diff line change
@@ -328,4 +328,45 @@ This will pretty-print the analyzed tree of your query. If you see a node named
328
328
329
329
-`commit_files.file_path = files.file_path`
330
330
-`commit_files.tree_hash = files.tree_hash`
331
-
-`commit_files.blob_hash = files.blob_hash`
331
+
-`commit_files.blob_hash = files.blob_hash`
332
+
333
+
## GROUP BY and ORDER BY memory optimization
334
+
335
+
The way GROUP BY and ORDER BY are implemented, they hold all the rows their child node will return in memory and once all of them are present, the grouping/sort is computed.
336
+
In order to optimise a query having an ORDER BY or GROUP BY is important to perform those operations as late as possible and with the least amount of data possible. Otherwise, they can have a very big impact on memory usage and performance.
337
+
338
+
For example, consider the following query:
339
+
340
+
```sql
341
+
SELECT LANGUAGE(f.file_path) as lang, SUM(ARRAY_LENGTH(SPLIT(f.blob_content, "\n"))-1) as lines
342
+
FROM ref_commits rc
343
+
NATURAL JOIN commits c
344
+
NATURAL JOIN commit_files cf
345
+
NATURAL JOIN files f
346
+
WHERErc.ref_name='HEAD'
347
+
ANDf.file_path NOT REGEXP '^vendor.*'
348
+
AND NOT IS_BINARY(f.blob_content)
349
+
GROUP BY lang
350
+
```
351
+
352
+
This query returns the total number of lines of code per language in all files in the HEAD reference of all repositories. What happens here is that grouping will be done with a row that contains `blob_content`. This means a lot of data will be kept in memory to perform this aggregation. That could lead to tens of gigabytes of RAM usage if there are a lot of repositories in the dataset.
353
+
354
+
Instead, the following query returns exactly the same rows, but only outputs what's necessary in a subquery, keeping way less data in memory.
355
+
356
+
```sql
357
+
SELECT lang, SUM(lines) AS lines
358
+
FROM (
359
+
SELECT LANGUAGE(f.file_path, f.blob_content) as lang,
360
+
(ARRAY_LENGTH(SPLIT(f.blob_content, "\n"))-1) as lines
361
+
FROM ref_commits rc
362
+
NATURAL JOIN commits c
363
+
NATURAL JOIN commit_files cf
364
+
NATURAL JOIN files f
365
+
WHERErc.ref_name='HEAD'
366
+
ANDcf.file_path NOT REGEXP '^vendor.*'
367
+
AND NOT IS_BINARY(f.blob_content)
368
+
) t
369
+
GROUP BY lang
370
+
```
371
+
372
+
As a good rule of thumb: defer as much as possible GROUP BY and ORDER BY operations and only perform them with the minimum amount of data needed.
0 commit comments