-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
deflate/inflate implementations are ported rather than implemented from first principles #18062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would like to work on this, but I am absolutely a newbie in terms of the Zig project. It would be helpful if someone is willing to mentor me. |
What is this function doing here? I can't find it being referenced anywhere else. |
it's just copying memory awkwardly |
can't we just remove it? I mean, since we already have |
Yes, I think that a better implementation would not need this function. The current implementation relies on it because it isn't careful about aliasing. |
the best place for mentorship is one of the wonderful Communitys |
for anyone coming across this later, deflate is defined by https://datatracker.ietf.org/doc/html/rfc1951 for reference |
The initial goal of porting the deflate implementation from the Go code was to advance the project of the package manager by addressing the issue mentioned in #213. That being said, I'll be happy to answer questions for anyone wanting to remove it and replace it with a pure Zig implementation.
Every comment was faithfully copied (verbatim) from the Go code to facilitate tracking both sets of code easily and prevent any misinterpretation on my part.
It is utilized here: zig/lib/std/compress/deflate/compressor.zig Line 330 in ea4a077
It was my way to implement a copy() function that emulates the behavior of Go's built-in copy function (https://go.dev/ref/spec#Appending_and_copying_slices), ensuring copy from even when the source and destination memory overlaps.
|
I looked at the implementation and I feel like we could achieve the same thing with something like: // only copy the number of bytes required
const n = @min(dest.len, b.len);
std.mem.copyForwards(u8, dest[0..n], b[0..n]); The ported |
Also decompressor.zig claims to use up to 300KiB from provided allocator but that turned out to be a lie and png reader started failing for big files. I know original zlib claims not to use more than 35KiB of memory so there is room for improvement even there. If this is reimplemented I would suggest wrapping the passed in allocator with a FixedBufferAllocator of that maximum size so that tests can fails if you accidentally allocate more than that amount. |
Here is one implementation that hopefully ticks Andrew's requirements. I'm planning to convert that to the PR in the following days. |
It looks like great work! Thank you for taking the time to upstream it :-) |
Zig deflate compression/decompression implementation. It supports compression and decompression of gzip, zlib and raw deflate format. Fixes ziglang#18062. This PR replaces current compress/gzip and compress/zlib packages. Deflate package is renamed to flate. Flate is common name for deflate/inflate where deflate is compression and inflate decompression. There are breaking change. Methods signatures are changed because of removal of the allocator, and I also unified API for all three namespaces (flate, gzip, zlib). Currently I put old packages under v1 namespace they are still available as compress/v1/gzip, compress/v1/zlib, compress/v1/deflate. Idea is to give users of the current API little time to postpone analyzing what they had to change. Although that rises question when it is safe to remove that v1 namespace. Here is current API in the compress package: ```Zig // deflate fn compressor(allocator, writer, options) !Compressor(@typeof(writer)) fn Compressor(comptime WriterType) type fn decompressor(allocator, reader, null) !Decompressor(@typeof(reader)) fn Decompressor(comptime ReaderType: type) type // gzip fn compress(allocator, writer, options) !Compress(@typeof(writer)) fn Compress(comptime WriterType: type) type fn decompress(allocator, reader) !Decompress(@typeof(reader)) fn Decompress(comptime ReaderType: type) type // zlib fn compressStream(allocator, writer, options) !CompressStream(@typeof(writer)) fn CompressStream(comptime WriterType: type) type fn decompressStream(allocator, reader) !DecompressStream(@typeof(reader)) fn DecompressStream(comptime ReaderType: type) type // xz fn decompress(allocator: Allocator, reader: anytype) !Decompress(@typeof(reader)) fn Decompress(comptime ReaderType: type) type // lzma fn decompress(allocator, reader) !Decompress(@typeof(reader)) fn Decompress(comptime ReaderType: type) type // lzma2 fn decompress(allocator, reader, writer !void // zstandard: fn DecompressStream(ReaderType, options) type fn decompressStream(allocator, reader) DecompressStream(@typeof(reader), .{}) struct decompress ``` The proposed naming convention: - Compressor/Decompressor for functions which return type, like Reader/Writer/GeneralPurposeAllocator - compressor/compressor for functions which are initializers for that type, like reader/writer/allocator - compress/decompress for one shot operations, accepts reader/writer pair, like read/write/alloc ```Zig /// Compress from reader and write compressed data to the writer. fn compress(reader: anytype, writer: anytype, options: Options) !void /// Create Compressor which outputs the writer. fn compressor(writer: anytype, options: Options) !Compressor(@typeof(writer)) /// Compressor type fn Compressor(comptime WriterType: type) type /// Decompress from reader and write plain data to the writer. fn decompress(reader: anytype, writer: anytype) !void /// Create Decompressor which reads from reader. fn decompressor(reader: anytype) Decompressor(@typeof(reader) /// Decompressor type fn Decompressor(comptime ReaderType: type) type ``` Comparing this implementation with the one we currently have in Zig's standard library (std). Std is roughly 1.2-1.4 times slower in decompression, and 1.1-1.2 times slower in compression. Compressed sizes are pretty much same in both cases. More resutls in [this](https://github.com/ianic/flate) repo. This library uses static allocations for all structures, doesn't require allocator. That makes sense especially for deflate where all structures, internal buffers are allocated to the full size. Little less for inflate where we std version uses less memory by not preallocating to theoretical max size array which are usually not fully used. For deflate this library allocates 395K while std 779K. For inflate this library allocates 74.5K while std around 36K. Inflate difference is because we here use 64K history instead of 32K in std. If merged existing usage of compress gzip/zlib/deflate need some changes. Here is example with necessary changes in comments: ```Zig const std = @import("std"); // To get this file: // wget -nc -O war_and_peace.txt https://www.gutenberg.org/ebooks/2600.txt.utf-8 const data = @embedfile("war_and_peace.txt"); pub fn main() !void { var gpa = std.heap.GeneralPurposeAllocator(.{}){}; defer std.debug.assert(gpa.deinit() == .ok); const allocator = gpa.allocator(); try oldDeflate(allocator); try new(std.compress.flate, allocator); try oldZlib(allocator); try new(std.compress.zlib, allocator); try oldGzip(allocator); try new(std.compress.gzip, allocator); } pub fn new(comptime pkg: type, allocator: std.mem.Allocator) !void { var buf = std.ArrayList(u8).init(allocator); defer buf.deinit(); // Compressor var cmp = try pkg.compressor(buf.writer(), .{}); _ = try cmp.write(data); try cmp.finish(); var fbs = std.io.fixedBufferStream(buf.items); // Decompressor var dcp = pkg.decompressor(fbs.reader()); const plain = try dcp.reader().readAllAlloc(allocator, std.math.maxInt(usize)); defer allocator.free(plain); try std.testing.expectEqualSlices(u8, data, plain); } pub fn oldDeflate(allocator: std.mem.Allocator) !void { const deflate = std.compress.v1.deflate; // Compressor var buf = std.ArrayList(u8).init(allocator); defer buf.deinit(); // Remove allocator // Rename deflate -> flate var cmp = try deflate.compressor(allocator, buf.writer(), .{}); _ = try cmp.write(data); try cmp.close(); // Rename to finish cmp.deinit(); // Remove // Decompressor var fbs = std.io.fixedBufferStream(buf.items); // Remove allocator and last param // Rename deflate -> flate // Remove try var dcp = try deflate.decompressor(allocator, fbs.reader(), null); defer dcp.deinit(); // Remove const plain = try dcp.reader().readAllAlloc(allocator, std.math.maxInt(usize)); defer allocator.free(plain); try std.testing.expectEqualSlices(u8, data, plain); } pub fn oldZlib(allocator: std.mem.Allocator) !void { const zlib = std.compress.v1.zlib; var buf = std.ArrayList(u8).init(allocator); defer buf.deinit(); // Compressor // Rename compressStream => compressor // Remove allocator var cmp = try zlib.compressStream(allocator, buf.writer(), .{}); _ = try cmp.write(data); try cmp.finish(); cmp.deinit(); // Remove var fbs = std.io.fixedBufferStream(buf.items); // Decompressor // decompressStream => decompressor // Remove allocator // Remove try var dcp = try zlib.decompressStream(allocator, fbs.reader()); defer dcp.deinit(); // Remove const plain = try dcp.reader().readAllAlloc(allocator, std.math.maxInt(usize)); defer allocator.free(plain); try std.testing.expectEqualSlices(u8, data, plain); } pub fn oldGzip(allocator: std.mem.Allocator) !void { const gzip = std.compress.v1.gzip; var buf = std.ArrayList(u8).init(allocator); defer buf.deinit(); // Compressor // Rename compress => compressor // Remove allocator var cmp = try gzip.compress(allocator, buf.writer(), .{}); _ = try cmp.write(data); try cmp.close(); // Rename to finisho cmp.deinit(); // Remove var fbs = std.io.fixedBufferStream(buf.items); // Decompressor // Rename decompress => decompressor // Remove allocator // Remove try var dcp = try gzip.decompress(allocator, fbs.reader()); defer dcp.deinit(); // Remove const plain = try dcp.reader().readAllAlloc(allocator, std.math.maxInt(usize)); defer allocator.free(plain); try std.testing.expectEqualSlices(u8, data, plain); } ```
Zig deflate compression/decompression implementation. It supports compression and decompression of gzip, zlib and raw deflate format. Fixes ziglang#18062. This PR replaces current compress/gzip and compress/zlib packages. Deflate package is renamed to flate. Flate is common name for deflate/inflate where deflate is compression and inflate decompression. There are breaking change. Methods signatures are changed because of removal of the allocator, and I also unified API for all three namespaces (flate, gzip, zlib). Currently I put old packages under v1 namespace they are still available as compress/v1/gzip, compress/v1/zlib, compress/v1/deflate. Idea is to give users of the current API little time to postpone analyzing what they had to change. Although that rises question when it is safe to remove that v1 namespace. Here is current API in the compress package: ```Zig // deflate fn compressor(allocator, writer, options) !Compressor(@typeof(writer)) fn Compressor(comptime WriterType) type fn decompressor(allocator, reader, null) !Decompressor(@typeof(reader)) fn Decompressor(comptime ReaderType: type) type // gzip fn compress(allocator, writer, options) !Compress(@typeof(writer)) fn Compress(comptime WriterType: type) type fn decompress(allocator, reader) !Decompress(@typeof(reader)) fn Decompress(comptime ReaderType: type) type // zlib fn compressStream(allocator, writer, options) !CompressStream(@typeof(writer)) fn CompressStream(comptime WriterType: type) type fn decompressStream(allocator, reader) !DecompressStream(@typeof(reader)) fn DecompressStream(comptime ReaderType: type) type // xz fn decompress(allocator: Allocator, reader: anytype) !Decompress(@typeof(reader)) fn Decompress(comptime ReaderType: type) type // lzma fn decompress(allocator, reader) !Decompress(@typeof(reader)) fn Decompress(comptime ReaderType: type) type // lzma2 fn decompress(allocator, reader, writer !void // zstandard: fn DecompressStream(ReaderType, options) type fn decompressStream(allocator, reader) DecompressStream(@typeof(reader), .{}) struct decompress ``` The proposed naming convention: - Compressor/Decompressor for functions which return type, like Reader/Writer/GeneralPurposeAllocator - compressor/compressor for functions which are initializers for that type, like reader/writer/allocator - compress/decompress for one shot operations, accepts reader/writer pair, like read/write/alloc ```Zig /// Compress from reader and write compressed data to the writer. fn compress(reader: anytype, writer: anytype, options: Options) !void /// Create Compressor which outputs the writer. fn compressor(writer: anytype, options: Options) !Compressor(@typeof(writer)) /// Compressor type fn Compressor(comptime WriterType: type) type /// Decompress from reader and write plain data to the writer. fn decompress(reader: anytype, writer: anytype) !void /// Create Decompressor which reads from reader. fn decompressor(reader: anytype) Decompressor(@typeof(reader) /// Decompressor type fn Decompressor(comptime ReaderType: type) type ``` Comparing this implementation with the one we currently have in Zig's standard library (std). Std is roughly 1.2-1.4 times slower in decompression, and 1.1-1.2 times slower in compression. Compressed sizes are pretty much same in both cases. More resutls in [this](https://github.com/ianic/flate) repo. This library uses static allocations for all structures, doesn't require allocator. That makes sense especially for deflate where all structures, internal buffers are allocated to the full size. Little less for inflate where we std version uses less memory by not preallocating to theoretical max size array which are usually not fully used. For deflate this library allocates 395K while std 779K. For inflate this library allocates 74.5K while std around 36K. Inflate difference is because we here use 64K history instead of 32K in std. If merged existing usage of compress gzip/zlib/deflate need some changes. Here is example with necessary changes in comments: ```Zig const std = @import("std"); // To get this file: // wget -nc -O war_and_peace.txt https://www.gutenberg.org/ebooks/2600.txt.utf-8 const data = @embedfile("war_and_peace.txt"); pub fn main() !void { var gpa = std.heap.GeneralPurposeAllocator(.{}){}; defer std.debug.assert(gpa.deinit() == .ok); const allocator = gpa.allocator(); try oldDeflate(allocator); try new(std.compress.flate, allocator); try oldZlib(allocator); try new(std.compress.zlib, allocator); try oldGzip(allocator); try new(std.compress.gzip, allocator); } pub fn new(comptime pkg: type, allocator: std.mem.Allocator) !void { var buf = std.ArrayList(u8).init(allocator); defer buf.deinit(); // Compressor var cmp = try pkg.compressor(buf.writer(), .{}); _ = try cmp.write(data); try cmp.finish(); var fbs = std.io.fixedBufferStream(buf.items); // Decompressor var dcp = pkg.decompressor(fbs.reader()); const plain = try dcp.reader().readAllAlloc(allocator, std.math.maxInt(usize)); defer allocator.free(plain); try std.testing.expectEqualSlices(u8, data, plain); } pub fn oldDeflate(allocator: std.mem.Allocator) !void { const deflate = std.compress.v1.deflate; // Compressor var buf = std.ArrayList(u8).init(allocator); defer buf.deinit(); // Remove allocator // Rename deflate -> flate var cmp = try deflate.compressor(allocator, buf.writer(), .{}); _ = try cmp.write(data); try cmp.close(); // Rename to finish cmp.deinit(); // Remove // Decompressor var fbs = std.io.fixedBufferStream(buf.items); // Remove allocator and last param // Rename deflate -> flate // Remove try var dcp = try deflate.decompressor(allocator, fbs.reader(), null); defer dcp.deinit(); // Remove const plain = try dcp.reader().readAllAlloc(allocator, std.math.maxInt(usize)); defer allocator.free(plain); try std.testing.expectEqualSlices(u8, data, plain); } pub fn oldZlib(allocator: std.mem.Allocator) !void { const zlib = std.compress.v1.zlib; var buf = std.ArrayList(u8).init(allocator); defer buf.deinit(); // Compressor // Rename compressStream => compressor // Remove allocator var cmp = try zlib.compressStream(allocator, buf.writer(), .{}); _ = try cmp.write(data); try cmp.finish(); cmp.deinit(); // Remove var fbs = std.io.fixedBufferStream(buf.items); // Decompressor // decompressStream => decompressor // Remove allocator // Remove try var dcp = try zlib.decompressStream(allocator, fbs.reader()); defer dcp.deinit(); // Remove const plain = try dcp.reader().readAllAlloc(allocator, std.math.maxInt(usize)); defer allocator.free(plain); try std.testing.expectEqualSlices(u8, data, plain); } pub fn oldGzip(allocator: std.mem.Allocator) !void { const gzip = std.compress.v1.gzip; var buf = std.ArrayList(u8).init(allocator); defer buf.deinit(); // Compressor // Rename compress => compressor // Remove allocator var cmp = try gzip.compress(allocator, buf.writer(), .{}); _ = try cmp.write(data); try cmp.close(); // Rename to finisho cmp.deinit(); // Remove var fbs = std.io.fixedBufferStream(buf.items); // Decompressor // Rename decompress => decompressor // Remove allocator // Remove try var dcp = try gzip.decompress(allocator, fbs.reader()); defer dcp.deinit(); // Remove const plain = try dcp.reader().readAllAlloc(allocator, std.math.maxInt(usize)); defer allocator.free(plain); try std.testing.expectEqualSlices(u8, data, plain); } ```
Zig deflate compression/decompression implementation. It supports compression and decompression of gzip, zlib and raw deflate format. Fixes ziglang#18062. This PR replaces current compress/gzip and compress/zlib packages. Deflate package is renamed to flate. Flate is common name for deflate/inflate where deflate is compression and inflate decompression. There are breaking change. Methods signatures are changed because of removal of the allocator, and I also unified API for all three namespaces (flate, gzip, zlib). Currently I put old packages under v1 namespace they are still available as compress/v1/gzip, compress/v1/zlib, compress/v1/deflate. Idea is to give users of the current API little time to postpone analyzing what they had to change. Although that rises question when it is safe to remove that v1 namespace. Here is current API in the compress package: ```Zig // deflate fn compressor(allocator, writer, options) !Compressor(@typeof(writer)) fn Compressor(comptime WriterType) type fn decompressor(allocator, reader, null) !Decompressor(@typeof(reader)) fn Decompressor(comptime ReaderType: type) type // gzip fn compress(allocator, writer, options) !Compress(@typeof(writer)) fn Compress(comptime WriterType: type) type fn decompress(allocator, reader) !Decompress(@typeof(reader)) fn Decompress(comptime ReaderType: type) type // zlib fn compressStream(allocator, writer, options) !CompressStream(@typeof(writer)) fn CompressStream(comptime WriterType: type) type fn decompressStream(allocator, reader) !DecompressStream(@typeof(reader)) fn DecompressStream(comptime ReaderType: type) type // xz fn decompress(allocator: Allocator, reader: anytype) !Decompress(@typeof(reader)) fn Decompress(comptime ReaderType: type) type // lzma fn decompress(allocator, reader) !Decompress(@typeof(reader)) fn Decompress(comptime ReaderType: type) type // lzma2 fn decompress(allocator, reader, writer !void // zstandard: fn DecompressStream(ReaderType, options) type fn decompressStream(allocator, reader) DecompressStream(@typeof(reader), .{}) struct decompress ``` The proposed naming convention: - Compressor/Decompressor for functions which return type, like Reader/Writer/GeneralPurposeAllocator - compressor/compressor for functions which are initializers for that type, like reader/writer/allocator - compress/decompress for one shot operations, accepts reader/writer pair, like read/write/alloc ```Zig /// Compress from reader and write compressed data to the writer. fn compress(reader: anytype, writer: anytype, options: Options) !void /// Create Compressor which outputs the writer. fn compressor(writer: anytype, options: Options) !Compressor(@typeof(writer)) /// Compressor type fn Compressor(comptime WriterType: type) type /// Decompress from reader and write plain data to the writer. fn decompress(reader: anytype, writer: anytype) !void /// Create Decompressor which reads from reader. fn decompressor(reader: anytype) Decompressor(@typeof(reader) /// Decompressor type fn Decompressor(comptime ReaderType: type) type ``` Comparing this implementation with the one we currently have in Zig's standard library (std). Std is roughly 1.2-1.4 times slower in decompression, and 1.1-1.2 times slower in compression. Compressed sizes are pretty much same in both cases. More resutls in [this](https://github.com/ianic/flate) repo. This library uses static allocations for all structures, doesn't require allocator. That makes sense especially for deflate where all structures, internal buffers are allocated to the full size. Little less for inflate where we std version uses less memory by not preallocating to theoretical max size array which are usually not fully used. For deflate this library allocates 395K while std 779K. For inflate this library allocates 74.5K while std around 36K. Inflate difference is because we here use 64K history instead of 32K in std. If merged existing usage of compress gzip/zlib/deflate need some changes. Here is example with necessary changes in comments: ```Zig const std = @import("std"); // To get this file: // wget -nc -O war_and_peace.txt https://www.gutenberg.org/ebooks/2600.txt.utf-8 const data = @embedfile("war_and_peace.txt"); pub fn main() !void { var gpa = std.heap.GeneralPurposeAllocator(.{}){}; defer std.debug.assert(gpa.deinit() == .ok); const allocator = gpa.allocator(); try oldDeflate(allocator); try new(std.compress.flate, allocator); try oldZlib(allocator); try new(std.compress.zlib, allocator); try oldGzip(allocator); try new(std.compress.gzip, allocator); } pub fn new(comptime pkg: type, allocator: std.mem.Allocator) !void { var buf = std.ArrayList(u8).init(allocator); defer buf.deinit(); // Compressor var cmp = try pkg.compressor(buf.writer(), .{}); _ = try cmp.write(data); try cmp.finish(); var fbs = std.io.fixedBufferStream(buf.items); // Decompressor var dcp = pkg.decompressor(fbs.reader()); const plain = try dcp.reader().readAllAlloc(allocator, std.math.maxInt(usize)); defer allocator.free(plain); try std.testing.expectEqualSlices(u8, data, plain); } pub fn oldDeflate(allocator: std.mem.Allocator) !void { const deflate = std.compress.v1.deflate; // Compressor var buf = std.ArrayList(u8).init(allocator); defer buf.deinit(); // Remove allocator // Rename deflate -> flate var cmp = try deflate.compressor(allocator, buf.writer(), .{}); _ = try cmp.write(data); try cmp.close(); // Rename to finish cmp.deinit(); // Remove // Decompressor var fbs = std.io.fixedBufferStream(buf.items); // Remove allocator and last param // Rename deflate -> flate // Remove try var dcp = try deflate.decompressor(allocator, fbs.reader(), null); defer dcp.deinit(); // Remove const plain = try dcp.reader().readAllAlloc(allocator, std.math.maxInt(usize)); defer allocator.free(plain); try std.testing.expectEqualSlices(u8, data, plain); } pub fn oldZlib(allocator: std.mem.Allocator) !void { const zlib = std.compress.v1.zlib; var buf = std.ArrayList(u8).init(allocator); defer buf.deinit(); // Compressor // Rename compressStream => compressor // Remove allocator var cmp = try zlib.compressStream(allocator, buf.writer(), .{}); _ = try cmp.write(data); try cmp.finish(); cmp.deinit(); // Remove var fbs = std.io.fixedBufferStream(buf.items); // Decompressor // decompressStream => decompressor // Remove allocator // Remove try var dcp = try zlib.decompressStream(allocator, fbs.reader()); defer dcp.deinit(); // Remove const plain = try dcp.reader().readAllAlloc(allocator, std.math.maxInt(usize)); defer allocator.free(plain); try std.testing.expectEqualSlices(u8, data, plain); } pub fn oldGzip(allocator: std.mem.Allocator) !void { const gzip = std.compress.v1.gzip; var buf = std.ArrayList(u8).init(allocator); defer buf.deinit(); // Compressor // Rename compress => compressor // Remove allocator var cmp = try gzip.compress(allocator, buf.writer(), .{}); _ = try cmp.write(data); try cmp.close(); // Rename to finisho cmp.deinit(); // Remove var fbs = std.io.fixedBufferStream(buf.items); // Decompressor // Rename decompress => decompressor // Remove allocator // Remove try var dcp = try gzip.decompress(allocator, fbs.reader()); defer dcp.deinit(); // Remove const plain = try dcp.reader().readAllAlloc(allocator, std.math.maxInt(usize)); defer allocator.free(plain); try std.testing.expectEqualSlices(u8, data, plain); } ```
Global variables (unacceptable; warrants the "bug" label):
zig/lib/std/compress/deflate/decompressor.zig
Line 255 in ea4a077
zig/lib/std/compress/deflate/decompressor.zig
Line 24 in ea4a077
What the hell is this comment doing here?
zig/lib/std/compress/deflate/decompressor.zig
Line 841 in ea4a077
Requiring an allocator:
zig/lib/std/compress/deflate/decompressor.zig
Line 76 in ea4a077
This implementation is so problematic precisely because it was ported. We can do better by writing these functions from first principles, using Zig's guiding principles and it will be more robust, optimal, and reusable.
In order to close this issue:
Related:
The text was updated successfully, but these errors were encountered: