Skip to content

Commit 308ddd7

Browse files
authored
Add documentation on file formats. (#598)
1 parent f81c760 commit 308ddd7

File tree

1 file changed

+127
-0
lines changed

1 file changed

+127
-0
lines changed

dev-docs/file-formats.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Designing file formats
21+
22+
## Use little JVM heap
23+
24+
Lucene generally prefers to avoid loading gigabytes of data into the JVM heap.
25+
Could this data be stored in a file and accessed using a
26+
`org.apache.lucene.store.RandomAccessInput` instead?
27+
28+
## Avoid options
29+
30+
One of the hardest problems with file formats is maintaining backward
31+
compatibility. Avoid giving options to the user, and instead let the file
32+
format make decisions based on the information it has. If an expert user wants
33+
to optimize for a specific case, they can write a custom codec and maintain it
34+
on their own.
35+
36+
## How to split the data into files?
37+
38+
Most file formats split the data into 3 files:
39+
- metadata,
40+
- index data,
41+
- raw data.
42+
43+
The metadata file contains all the data that is read once at open time. This
44+
helps on several fronts:
45+
- One can validate the checksums of this data at open time without significant
46+
overhead since all data needs to be read anyway, this helps detect
47+
corruptions early.
48+
- No need to perform expensive seeks into the index/raw data files at open
49+
time, one can create slices into these files from offsets that have been
50+
written into the metadata file.
51+
52+
The index file contains data-structures that help search the raw data. For KD
53+
trees, this would be the inner nodes, for doc values this would be jump tables,
54+
for KNN vectors this would be the HNSW graph structure, for terms this would be
55+
the FST that stores term prefixes, etc. Having it in a separate file from the
56+
data file enables users to do things like `MMapDirectory#setPreload(boolean)`
57+
on these files which are generally rather small and accessed randomly. It is
58+
also convenient at times so that index and raw data can be written on the fly
59+
without buffering all index data into memory.
60+
61+
The raw file contains the data that needs to be retrieved.
62+
63+
Some file formats are simpler, e.g. the compound file format's index is so
64+
small that it can be loaded fully into memory at open time. So it becomes
65+
read-once and can be stored in the same file as metadata.
66+
67+
Some file formats are more complex, e.g. postings have multiple types of data
68+
(docs, freqs, positions, offsets, payloads) that are optionally retrieved, so
69+
they use multiple data files in order not to have to read lots of useless data.
70+
71+
## Don't use too many files
72+
73+
The maximum number of file descriptors is usually not infinite. It's ok to use
74+
multiple files per segment as described above, but this number should always be
75+
small. For instance, it would be a bad practice to use a different file per
76+
field.
77+
78+
## Add codec headers and footers to all files
79+
80+
Use `CodecUtil` to add headers and footers to all files of the index. This
81+
helps make sure that we are opening the right file and differenciate Lucene
82+
bugs from file corruptions.
83+
84+
## Validate checksums of the metadata file when opening a segment
85+
86+
If data has been organized in such a way that the metadata file only contains
87+
read-once data then verifying checksums is very cheap to do and can help detect
88+
corruptions early and in a way that we can give users a meaningful error
89+
message that tells users that their index is corrupt, rather than a confusing
90+
exception that tells them that Lucene tried to read data beyond the end of the
91+
file or anything like that.
92+
93+
## Validate structures of other files when opening a segment
94+
95+
One of the most frequent case of index corruption that we have observed over
96+
the years is file truncation. Verifying that index files have the expected
97+
codec header and a correct structure for the codec footer when opening a
98+
segment helps detect a significant spectrum of cases of corruption.
99+
100+
## Do as many consistency checks as reasonable
101+
102+
It is common for some data to be redundant, e.g. data from the metadata file
103+
might be redundant with information from `FieldInfos`, or all files from the
104+
same file format should have the same version in their codec header. Checking
105+
that these redundant pieces of information are consistent is always a good
106+
idea, as it would make cases of corruption much easier to debug.
107+
108+
## Make sure to not leak files
109+
110+
Be paranoid regarding where exceptions might be thrown and make sure that files
111+
would be closed on all paths. E.g. imagine that opening the data file fails
112+
while the index file is already open, make sure that the index file would also
113+
get closed in that case. Lucene has tests that randomly throw exceptions when
114+
interacting with the `Directory` in order to detect some bugs, but it might
115+
take many runs before randomization triggers the exact case that triggers a
116+
bug.
117+
118+
## Verify checksums upon merges
119+
120+
Merges need to read most if not all input data anyway, so make sure to verify
121+
checksums before starting a merge by calling `checkIntegrity()` on the file
122+
format reader in order to make sure that file corruptions don't get propagated
123+
by merges. All default implementations do this.
124+
125+
## How to make backward-compatible changes to file formats?
126+
127+
See [here](../lucene/backward-codecs/README.md).

0 commit comments

Comments
 (0)