You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/using-gitbase/optimize-queries.md
+37
Original file line number
Diff line number
Diff line change
@@ -156,6 +156,43 @@ This has two advantages:
156
156
157
157
As a result, your query could be orders of magnitude faster.
158
158
159
+
#### Squashed table optimizations
160
+
161
+
In squashed tables, data flows from the topmost table in terms of hierarchy towards the rest of the tables. That way, if a squashed table is made of `repositories`, `commits` and `commit_files` the process to generate the data is the following:
162
+
163
+
1. Get a repository. If there are no more repositories, finish.
164
+
2. If it satisfies the filters given to the `repositories` table go to step 3, otherwise, go to step 1 again.
165
+
3. Get the next commit for the current repository. If there are no more commits for this repository, go to 1 again.
166
+
4. If it satisfies the filters given to the `commits` table go to step 4, otherwise, go to step 3 again.
167
+
5. Get the next commit file for the current commit. If there are no more commit files for this commit, go to 3 again.
168
+
6. If it satisfies the filters given to the `commits_files` table return the composed row, otherwise, go to step 5 again.
169
+
170
+
This way, the less data coming from the upper table, the less work the next table will have to do, and thus, the faster it will be. A good rule of thumb is to apply a filter as soon as possible. That is, if there is a filter by `repository_id` it's better to do `repositories.repository_id = 'SOME REPO'` than `commits.repository_id = 'SOME_REPO'`. Because even if the result will be the same, it will avoid doing a lot of useless computing for the repositories that do not satisfy that filter.
171
+
172
+
To illustrate this, let's consider the following example:
173
+
174
+
We have 2 repositories, `A` and `B`. Each repository has 3 commits.
175
+
176
+
With this query we will get the three commits from `A`.
But we have processed `B`'s commits as well, because the filter is done in commits. 2 repositories make it to the `commits` table, and then it generates 6 rows, 3 of which make it past the filters, resulting in 3 rows.
183
+
184
+
With this query we will get the three commits from `A` as well.
However, this time, 1 repository makes it past the filters in the `repositories` table and is sent to the `commits` table, and then it generates 3 rows, resulting in 3 rows.
191
+
192
+
The results are the same but we have reduced significantly the amount of computing needed for this query. Now consider having 1000 repositories with 1M commits each. Both of these queries would be generating 1M rows. The difference is the first one would be computing 1B rows, and the second only 1M.
193
+
194
+
This advice can be applied to all squashed tables, not only `repository_id`.
195
+
159
196
#### Limitations
160
197
161
198
**Only works per repository**. This optimisation is built on top of some premises, one of them is the fact that all tables are joined by `repository_id`.
0 commit comments