Skip to content

Commit 921c8de

Browse files
authored
Merge pull request #19 from bobg/bobg/tree
The tree construction algorithm
2 parents aa1b312 + a39aa1c commit 921c8de

File tree

1 file changed

+84
-3
lines changed

1 file changed

+84
-3
lines changed

spec.md

Lines changed: 84 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ We also use the following operators and functions:
6565
i.e. if $X = \langle X_0, \dots, X_N \rangle$ and $Y = \langle Y_0,
6666
\dots, Y_M \rangle$ then $X \mathbin{\|} Y = \langle X_0, \dots, X_N, Y_0, \dots, Y_M
6767
\rangle$
68-
- $\min(x, y)$ denotes the minimum of $x$ and $y$.
68+
- $\operatorname{min}(x, y)$ denotes the minimum of $x$ and $y$.
6969

7070
# Splitting
7171

@@ -92,7 +92,7 @@ The "split index" $I(X)$ of a sequence $X$ is either the smallest integer $i$ sa
9292
- $S_{\text{max}} \ge i \ge S_{\text{min}}$ and
9393
- $H(\langle X_{i-W}, \dots, X_{i-1} \rangle) \mod 2^T = 0$
9494

95-
...or $\min(|X|, S_{\text{max}})$, if no such $i$ exists.
95+
...or $\operatorname{min}(|X|, S_{\text{max}})$, if no such $i$ exists.
9696

9797
The “prefix” $P(X)$ of a non-empty sequence $X$ is $\langle X_0, \dots, X_{I(X)-1} \rangle$.
9898

@@ -105,7 +105,88 @@ We define $\operatorname{SPLIT}_C(X)$ recursively, as follows:
105105

106106
# Tree Construction
107107

108-
TODO
108+
If sequence $X$ and sequence $Y$ are largely the same,
109+
$\operatorname{SPLIT}_C$ will produce mostly the same chunks,
110+
choosing the same locations for chunk boundaries
111+
except in the vicinity of whatever differences there are
112+
between $X$ and $Y$.
113+
114+
This has obvious benefits for storage and bandwidth,
115+
as the same chunks can represent both $X$ and $Y$ with few exceptions.
116+
But while only a small number of chunks may change,
117+
the _sequence_ of chunks may get totally rewritten,
118+
as when a difference exists near the beginning of $X$ and $Y$
119+
and all subsequent chunks have to “shift position” to the left or right.
120+
Representing the two different sequences may therefore require space
121+
that is linear in the size of $X$ and $Y$.
122+
123+
We can do better,
124+
requiring space that is only _logarithmic_ in the size of $X$ and $Y$,
125+
by organizing the chunks in a tree whose shape,
126+
like the chunk boundaries themselves,
127+
is determined by the content of the input.
128+
The trees representing two slightly different versions of the same input
129+
will differ only in the subtrees in the vicinity of the differences.
130+
131+
## Definitions
132+
133+
The “hashval” $V(X)$ of a sequence $X$ is:
134+
135+
$H(\langle X_{\operatorname{max}(0, |X|-W)}, \dots, X_{|X|-1} \rangle)$
136+
137+
(i.e., the hash of the last $W$ bytes of $X$).
138+
139+
The “level” $L(X)$ of a sequence $X$ is $Q - T$,
140+
where $Q$ is the largest integer such that
141+
142+
- $Q \le 32$ and
143+
- $V(P(X)) \mod 2^Q = 0$
144+
145+
(i.e., the level is the number of trailing zeroes in the rolling checksum in excess of the threshold needed to produce the prefix chunk $P(X)$).
146+
147+
(Note:
148+
When $|R(X)| > 0$,
149+
$L(X)$ is non-negative,
150+
because $P(X)$ is defined in terms of a hash with $T$ trailing zeroes.
151+
But when $|R(X)| = 0$,
152+
that hash may have fewer than $T$ trailing zeroes,
153+
and so $L(X)$ may be negative.
154+
This makes no difference to the algorithm below, however.)
155+
156+
A “node” in a hashsplit tree
157+
is a pair $(D, C)$
158+
where $D$ is the node’s “depth”
159+
and $C$ is a sequence of children.
160+
The children of a node at depth 0 are chunks
161+
(i.e., subsequences of the input).
162+
The children of a node at depth $D > 0$ are nodes at depth $D - 1$.
163+
164+
The function $\operatorname{Children}(N)$ on a node $N = (D, C)$ produces $C$
165+
(the sequence of children).
166+
167+
## Algorithm
168+
169+
To compute a hashsplit tree from sequence $X$,
170+
compute its “root node” as follows.
171+
172+
1. Let $N_0$ be $(0, \langle\rangle)$ (i.e., a node at depth 0 with no children).
173+
2. If $|X| = 0$, then:
174+
a. Let $d$ be the largest depth such that $N_d$ exists.
175+
b. If $|\operatorname{Children}(N_0)| > 0$, then:
176+
i. For each integer $i$ in $[0 .. d]$, “close” $N_i$.
177+
ii. Set $d \leftarrow d+1$.
178+
c. [pruning] While $d > 0$ and $|\operatorname{Children}(N_d)| = 1$, set $d \leftarrow d-1$ (i.e., traverse from the prospective tree root downward until there is a node with more than one child).
179+
d. **Terminate** with $N_d$ as the root node.
180+
3. Otherwise, set $N_0 \leftarrow (0, \operatorname{Children}(N_0) \mathbin{\|} \langle P(X) \rangle)$ (i.e., add $P(X)$ to the list of children in $N_0$).
181+
4. For each integer $i$ in $[0 .. L(X))$, “close” the node $N_i$ (see below).
182+
5. Set $X \leftarrow R(X)$.
183+
6. Go to step 2.
184+
185+
To “close” a node $N_i$:
186+
187+
1. If no $N_{i+1}$ exists yet, let $N_{i+1}$ be $(i+1, \langle\rangle)$ (i.e., a node at depth ${i + 1}$ with no children).
188+
2. Set $N_{i+1} \leftarrow (i+1, \operatorname{Children}(N_{i+1}) \mathbin{\|} \langle N_i \rangle)$ (i.e., add $N_i$ as a child to $N_{i+1}$).
189+
3. Let $N_i$ be $(i, \langle\rangle)$ (i.e., new node at depth $i$ with no children).
109190

110191
# Rolling Hash Functions
111192

0 commit comments

Comments
 (0)