Skip to content

Commit c60e3e6

Browse files
authored
docs: expand optimization guide on early filtering (#837)
docs: expand optimization guide on early filtering
2 parents e5179a1 + db7a8e4 commit c60e3e6

File tree

1 file changed

+37
-0
lines changed

1 file changed

+37
-0
lines changed

docs/using-gitbase/optimize-queries.md

+37
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,43 @@ This has two advantages:
156156

157157
As a result, your query could be orders of magnitude faster.
158158

159+
#### Squashed table optimizations
160+
161+
In squashed tables, data flows from the topmost table in terms of hierarchy towards the rest of the tables. That way, if a squashed table is made of `repositories`, `commits` and `commit_files` the process to generate the data is the following:
162+
163+
1. Get a repository. If there are no more repositories, finish.
164+
2. If it satisfies the filters given to the `repositories` table go to step 3, otherwise, go to step 1 again.
165+
3. Get the next commit for the current repository. If there are no more commits for this repository, go to 1 again.
166+
4. If it satisfies the filters given to the `commits` table go to step 4, otherwise, go to step 3 again.
167+
5. Get the next commit file for the current commit. If there are no more commit files for this commit, go to 3 again.
168+
6. If it satisfies the filters given to the `commits_files` table return the composed row, otherwise, go to step 5 again.
169+
170+
This way, the less data coming from the upper table, the less work the next table will have to do, and thus, the faster it will be. A good rule of thumb is to apply a filter as soon as possible. That is, if there is a filter by `repository_id` it's better to do `repositories.repository_id = 'SOME REPO'` than `commits.repository_id = 'SOME_REPO'`. Because even if the result will be the same, it will avoid doing a lot of useless computing for the repositories that do not satisfy that filter.
171+
172+
To illustrate this, let's consider the following example:
173+
174+
We have 2 repositories, `A` and `B`. Each repository has 3 commits.
175+
176+
With this query we will get the three commits from `A`.
177+
178+
```sql
179+
SELECT * FROM repositories NATURAL JOIN commits WHERE commits.repository_id = 'A'
180+
```
181+
182+
But we have processed `B`'s commits as well, because the filter is done in commits. 2 repositories make it to the `commits` table, and then it generates 6 rows, 3 of which make it past the filters, resulting in 3 rows.
183+
184+
With this query we will get the three commits from `A` as well.
185+
186+
```sql
187+
SELECT * FROM repositories NATURAL JOIN commits WHERE repositories.repository_id = 'A'
188+
```
189+
190+
However, this time, 1 repository makes it past the filters in the `repositories` table and is sent to the `commits` table, and then it generates 3 rows, resulting in 3 rows.
191+
192+
The results are the same but we have reduced significantly the amount of computing needed for this query. Now consider having 1000 repositories with 1M commits each. Both of these queries would be generating 1M rows. The difference is the first one would be computing 1B rows, and the second only 1M.
193+
194+
This advice can be applied to all squashed tables, not only `repository_id`.
195+
159196
#### Limitations
160197

161198
**Only works per repository**. This optimisation is built on top of some premises, one of them is the fact that all tables are joined by `repository_id`.

0 commit comments

Comments
 (0)