Skip to content

Commit 6157207

Browse files
committed
filter-repo: add a --file-info-callback
This callback answers a common request users have to be able to operate on both file names and content, and throws in the mode while at it. It also makes our lint-history contrib example re-implementable as a few lines of shell script, but we'll leave it around anyway. Signed-off-by: Elijah Newren <[email protected]>
1 parent 756edb6 commit 6157207

5 files changed

+445
-30
lines changed

Documentation/converting-from-filter-branch.md

+37-13
Original file line numberDiff line numberDiff line change
@@ -320,20 +320,44 @@ filter-branch:
320320
'
321321
```
322322

323-
filter-repo decided not to provide a way to run an external program to
324-
do filtering, because most filter-branch uses of this ability are
325-
riddled with [safety
326-
problems](https://git-scm.com/docs/git-filter-branch#SAFETY) and
327-
[performance
328-
issues](https://git-scm.com/docs/git-filter-branch#PERFORMANCE).
329-
However, in special cases like this it's fairly safe. One can write a
330-
script that uses filter-repo as a library to achieve this, while also
331-
gaining filter-repo's automatic handling of other concerns like
332-
rewriting commit IDs in commit messages or pruning commits that become
333-
empty. In fact, one of the [contrib
323+
though it has the disadvantage of running on every c file for every
324+
commit in history, even if some commits do not modify any c files. This
325+
means this kind of command can be excruciatingly slow.
326+
327+
The same functionality is slightly more involved in filter-repo for
328+
two reasons:
329+
- fast-export and fast-import split file contents and file names into
330+
completely different data structures that aren't normally available
331+
together
332+
- to run a program on a file, you'll need to write the contents to the
333+
a file, execute the program on that file, and then read the contents
334+
of the file back in
335+
336+
```shell
337+
git filter-repo --file-info-callback '
338+
if not filename.endswith(b".c"):
339+
return (filename, mode, blob_id) # no changes
340+
341+
contents = value.get_contents_by_identifier(blob_id)
342+
tmpfile = os.path.basename(filename)
343+
with open(tmpfile, "wb") as f:
344+
f.write(contents)
345+
subprocess.check_call(["clang-format", "-style=file", "-i", filename])
346+
with open(filename, "rb") as f:
347+
contents = f.read()
348+
new_blob_id = value.insert_file_with_contents(contents)
349+
350+
return (filename, mode, new_blob_id)
351+
'
352+
```
353+
354+
However, one can write a script that uses filter-repo as a library to
355+
simplify this, while also gaining filter-repo's automatic handling of
356+
other concerns like rewriting commit IDs in commit messages or pruning
357+
commits that become empty. In fact, one of the [contrib
334358
demos](../contrib/filter-repo-demos),
335-
[lint-history](../contrib/filter-repo-demos/lint-history), handles
336-
this exact type of situation already:
359+
[lint-history](../contrib/filter-repo-demos/lint-history), was
360+
specifically written to make this kind of case really easy:
337361

338362
```shell
339363
lint-history --relevant 'return filename.endswith(b".c")' \

Documentation/git-filter-repo.txt

+112-6
Original file line numberDiff line numberDiff line change
@@ -288,6 +288,14 @@ Generic callback code snippets
288288
--refname-callback <function_body>::
289289
Python code body for processing refnames; see <<CALLBACKS>>.
290290

291+
--file-info-callback <function_body>::
292+
Python code body for processing the combination of filename, mode,
293+
and associated file contents; see <<CALLBACKS>. Note that when
294+
--file-info-callback is specified, any replacements specified by
295+
--replace-text will not be automatically applied; instead, you
296+
have control within the --file-info-callback to choose which files
297+
to apply those transformations to.
298+
291299
--blob-callback <function_body>::
292300
Python code body for processing blob objects; see <<CALLBACKS>>.
293301

@@ -1164,8 +1172,9 @@ that you should be aware of before using them; see the "API BACKWARD
11641172
COMPATIBILITY CAVEAT" comment near the top of git-filter-repo source
11651173
code.
11661174

1167-
All callback functions are of the same general format. For a command line
1168-
argument like
1175+
Most callback functions are of the same general format
1176+
(--file-info-callback is an exception which will be noted later). For
1177+
a command line argument like
11691178

11701179
--------------------------------------------------
11711180
--foo-callback 'BODY'
@@ -1209,6 +1218,7 @@ callbacks are:
12091218
--name-callback
12101219
--email-callback
12111220
--refname-callback
1221+
--file-info-callback
12121222
--------------------------------------------------
12131223

12141224
in each you are expected to simply return a new value based on the one
@@ -1272,10 +1282,106 @@ git-filter-repo --filename-callback '
12721282
'
12731283
--------------------------------------------------
12741284

1275-
In contrast, the blob, reset, tag, and commit callbacks are not
1276-
expected to return a value, but are instead expected to modify the
1277-
object passed in. Major fields for these objects are (subject to API
1278-
backward compatibility caveats mentioned previously):
1285+
The file-info callback is more involved. It is designed to be used in
1286+
cases where filtering depends on both filename and contents (and maybe
1287+
mode). It is called for file changes other than deletions (since
1288+
deletions have no file contents to operate on). The file info
1289+
callback takes four parameters (filename, mode, blob_id, and value),
1290+
and expects three to be returned (filename, mode, blob_id). The
1291+
filename is handled similar to the filename callback; it can be used
1292+
to rename the file (or set to None to drop the change). The mode is a
1293+
simple bytestring (b"100644" for regular non-executable files,
1294+
b"100755" for executable files/scripts, b"120000" for symlinks, and
1295+
b"160000" for submodules). The blob_id is most useful in conjunction
1296+
with the value parameter. The value parameter is an instance of a
1297+
class that has the following functions
1298+
value.get_contents_by_identifier(blob_id) -> contents (bytestring)
1299+
value.get_size_by_identifier(blob_id) -> size_of_blob (int)
1300+
value.insert_file_with_contents(contents) -> blob_id
1301+
value.is_binary(contents) -> bool
1302+
value.apply_replace_text(contents) -> new_contents (bytestring)
1303+
and has the following member data you can write to
1304+
value.data (dict)
1305+
These functions allow you to get the contents of the file, or its
1306+
size, create a new file in the stream whose blob_id you can return,
1307+
check whether some given contents are binary (using the heuristic from
1308+
the grep(1) command), and apply the replacement rules from --replace-text
1309+
(note that --file-info-callback makes the changes from --replace-text not
1310+
auto-apply). You could use this for example to only apply the changes
1311+
from --replace-text to certain file types and simultaneously rename the
1312+
files it applies the changes to:
1313+
1314+
--------------------------------------------------
1315+
git-filter-repo --file-info-callback '
1316+
if not filename.endswith(b".config"):
1317+
# Make no changes to the file; return as-is
1318+
return (filename, mode, blob_id)
1319+
1320+
new_filename = filename[0:-7] + b".cfg"
1321+
1322+
contents = value.get_contents_by_identifier(blob_id)
1323+
new_contents = value.apply_replace_text(contents)
1324+
new_blob_id = value.insert_file_with_contents(new_contents)
1325+
1326+
return (new_filename, mode, new_blob_id)
1327+
--------------------------------------------------
1328+
1329+
Note that if history has multiple revisions with the same file
1330+
(e.g. it was cherry-picked to multiple branches or there were a number
1331+
of reverts), then the --file-info-callback will be called multiple
1332+
times. If you want to avoid processing the same file multiple times,
1333+
then you can stash transformation results in the value.data dict.
1334+
For, example, we could modify the above example to make it only apply
1335+
transformations on blob_ids we have not seen before:
1336+
1337+
--------------------------------------------------
1338+
git-filter-repo --file-info-callback '
1339+
if not filename.endswith(b".config"):
1340+
# Make no changes to the file; return as-is
1341+
return (filename, mode, blob_id)
1342+
1343+
new_filename = filename[0:-7] + b".cfg"
1344+
1345+
if blob_id in value.data:
1346+
return (new_filename, mode, value.data[blob_id])
1347+
1348+
contents = value.get_contents_by_identifier(blob_id)
1349+
new_contents = value.apply_replace_text(contents)
1350+
new_blob_id = value.insert_file_with_contents(new_contents)
1351+
value.data[blob_id] = new_blob_id
1352+
1353+
return (new_filename, mode, new_blob_id)
1354+
--------------------------------------------------
1355+
1356+
An alternative example for the --file-info-callback is to make all
1357+
.sh files executable and add an extra trailing newline to the .sh
1358+
files:
1359+
1360+
--------------------------------------------------
1361+
git-filter-repo --file-info-callback '
1362+
if not filename.endswith(b".sh"):
1363+
# Make no changes to the file; return as-is
1364+
return (filename, mode, blob_id)
1365+
1366+
# There are only 4 valid modes in git:
1367+
# - 100644, for regular non-executable files
1368+
# - 100755, for executable files/scripts
1369+
# - 120000, for symlinks
1370+
# - 160000, for submodules
1371+
new_mode = b"100755"
1372+
1373+
contents = value.get_contents_by_identifier(blob_id)
1374+
new_contents = contents + b"\n"
1375+
new_blob_id = value.insert_file_with_contents(new_contents)
1376+
1377+
return (filename, new_mode, new_blob_id)
1378+
--------------------------------------------------
1379+
1380+
In contrast to the previous callback types, the blob, reset, tag, and
1381+
commit callbacks are not expected to return a value, but are instead
1382+
expected to modify the object passed in. Major fields for these
1383+
objects are (subject to API backward compatibility caveats mentioned
1384+
previously):
12791385

12801386
* Blob: `original_id` (original hash) and `data`
12811387
* Reset: `ref` (name of reference) and `from_ref` (hash or integer mark)

0 commit comments

Comments
 (0)