Skip to content

Commit 2ab45a2

Browse files
authored
Merge pull request #19 from c-cube/rfc-io
modular IO
2 parents 1eea97a + 54f6cfc commit 2ab45a2

File tree

1 file changed

+248
-0
lines changed

1 file changed

+248
-0
lines changed

rfcs/modular_io.md

Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
# modular IO channels in the stdlib
2+
3+
This RFC proposes to update the stdlib's types `in_channel` and `out_channel` to make them user-definable and composable.
4+
5+
Even though a lot of networked applications use lwt or async to achieve high levels of concurrency, classic blocking IO still has its uses. However, the standard OCaml channels suffer from some painful limitations:
6+
7+
- they cannot be created outside the stdlib, which means the only thing we can manipulate through them is sockets and other unix file descriptors.
8+
- they cannot be composed. In other languages [such as Go](https://golang.org/pkg/io/#Reader), one can write reader or writer combinators which transform the bytestream
9+
that is written or read. Typical examples would include (de)compression and encryption.
10+
- there is some duplication in `Printf`; namely, the existence, and incompatibility, of `bprintf`, `sprintf`, and `fprintf`. This makes `Printf` printers artificially limited
11+
since their type determines what kind of output they can produce. In my own experience, `Format` is a better choice because a single `Format.formatter -> t -> unit` function
12+
can be used in more cases than any single `Printf` function, despite the overhead of formatters.
13+
- input channels provide an API that is unsuited to some forms of parsing such
14+
as reading line by line. The stdlib's `input_line` has to cheat and use
15+
a special C primitive to know how much input to consume (by looking for `'\n'`
16+
in the underlying buffer).
17+
18+
As a consequence, many libraries have their own opaque channel types that
19+
are not compatible with the standard ones. That's a missed opportunity for
20+
code reuse and composability.
21+
22+
## Proof of concept
23+
24+
A proof of concept is being developped at https://github.com/c-cube/poc-modular-io .
25+
26+
It features the improved interface from below, extensible `In.t` and `Out.t`
27+
types, and a bunch of combinators written against the *public* interface,
28+
including `read_line`, http1.1 chunked encoding, char-by-char mapping, etc.
29+
30+
## Extensibility
31+
32+
The current types are implemented in C and are opaque. I propose that it be changed for:
33+
34+
```ocaml
35+
type in_channel =
36+
| IC_raw of old_in_channel (* implemented in C *)
37+
| IC_user of {
38+
read: bytes -> int -> int -> int;
39+
close: unit -> unit;
40+
}
41+
42+
type out_channel =
43+
| OC_raw of old_out_channel (* implemented in C *)
44+
| OC_user of {
45+
write: bytes -> int -> int -> int;
46+
flush: unit -> unit;
47+
close: unit -> unit;
48+
}
49+
50+
(* now doable in userland: *)
51+
52+
val ic_of_string : string -> in_channel
53+
val unzip : in_channel -> in_channel
54+
55+
val oc_of_buf : Buffer.t -> out_channel
56+
val zip : out_channel -> out_channel
57+
val encrypt_rot13: out_channel -> out_channel
58+
59+
(* Write to both channels *)
60+
val tee : out_channel -> out_channel -> out_channel
61+
62+
```
63+
64+
**Aside**: In fact, I'm not sure why the channels are still implemented in C. I think the base case could be
65+
a raw Unix file descriptor and a `bytes` buffer — but maybe this is needed for portability.
66+
67+
This change would dramatically improve compositionality, as one could then:
68+
69+
- implement `Printf.{b,s}printf` in terms of `Printf.fprintf` (which would be
70+
the most general one of the three);
71+
- compose transformations on channels, such as encoding, http chunking, encryption,
72+
compression;
73+
- use APIs that only operate on `in_channel` with strings.
74+
75+
### Backward compatibility
76+
77+
Functions in `Unix` that map file descriptors to/from channels would become
78+
partial (i.e. they wouldn't work on user-defined channels). The `Unix.seek`
79+
function would not work on user defined channels, but it
80+
is already partial anyway (because of sockets).
81+
82+
A function `{in,out}_channel_has_descr : {in,out}_channel -> bool` would help
83+
know what channels correspond to unix file descriptors. Alternatively, **if**
84+
the sum type is made public, a mere pattern matching can do.
85+
86+
## Interface improvement for `in_channel`
87+
88+
The current interface of `in_channel` provides, roughly, `input : bytes -> int -> int -> int`
89+
which takes a byte slice and returns how many bytes were read, `0` indicating end of input.
90+
This interface doesn't expose the underlying buffer and instead imitates the lower level posix APIs.
91+
92+
The problem of this interface is that it makes some functions quite awkward to write,
93+
and hurts compositionality.
94+
An interesting alternative is [rust's `BufRead` interface](https://doc.rust-lang.org/std/io/trait.BufRead.html).
95+
In OCaml, that corresponds roughly to:
96+
97+
```ocaml
98+
module In_channel : sig
99+
type t
100+
101+
(** Obtain a slice of the current buffer. Empty iff EOF was reached *)
102+
val fill_buf : t -> (bytes * int * int)
103+
104+
(** Consume n bytes from the input. *)
105+
val consume : t -> int -> unit
106+
107+
(** Close channel and release resources *)
108+
val close : t -> unit
109+
end
110+
```
111+
112+
The semantics of these operations is:
113+
114+
- `fill_buf ic` ensures that the channel's internal buffer is non-empty, unless
115+
end-of-input was reached. Then, it exposes a view of the internal buffer.
116+
The important aspect of this is that successive calls to `fill_buf` return
117+
the same result; this doesn't consume input on a logical level. This function
118+
just exposes a slice of the input.
119+
- `consume ic n` eats `n` bytes of the input. It must only be called after
120+
`fill_buf` returned a slice of length at least `n`. If the whole slice
121+
exposed by `fill_buf` was consumed, then the channel will have to read more
122+
from its underlying stream at the next call to `fill_buf`.
123+
- `close ic` closes the channel and releases underlying resources.
124+
125+
### Advantages
126+
127+
This interface is easier to use than the current `input` interface, especially when
128+
parsing formats with non-trivial framing (e.g. http1.1). One typically wants
129+
to read a line to get headers and framing (content-length) information, followed
130+
by a read of `n` bytes. It is therefore important to read the line(s) efficiently
131+
but without consuming _too much_ from the input buffer as it's possibly part
132+
of the payload.
133+
134+
Compare the stdlib's [`input_line` implementation](https://github.com/ocaml/ocaml/blob/f333db8b0f176b1d75e6fdb46a97a78995426ed7/stdlib/stdlib.ml#L439)
135+
which uses a magical external to look inside the C buffer, with this snippet
136+
(adapted from [tiny httpd](https://github.com/c-cube/tiny_httpd/blob/3ac5510e2d5dfcdf448a03a99c0c178b73afeabd/src/Tiny_httpd.ml#L159)):
137+
138+
```ocaml
139+
let input_line (ic:In_channel.t) : String.t =
140+
let buf = Buffer.create 32 in
141+
let continue = ref true in
142+
while !continue do
143+
let s, i, len = In_channel.fill_buf ic in
144+
if len=0 then (
145+
continue := false;
146+
if Buffer.length buf = 0 then raise End_of_file;
147+
);
148+
let j = ref i in
149+
(* look for ['\n'] in the input buffer *)
150+
while !j < i+len && Bytes.get s !j <> '\n' do
151+
incr j
152+
done;
153+
if !j-i < len then (
154+
assert (Bytes.get s !j = '\n');
155+
Buffer.add_bytes buf s i (!j-i); (* without '\n' *)
156+
In_channel.consume ic (!j-i+1); (* consume rest of line + '\n' *)
157+
continue := false
158+
) else (
159+
Buffer.add_bytes buf s i len;
160+
In_channel consume ic len;
161+
)
162+
done;
163+
Buffer.contents buf
164+
```
165+
166+
In the stdlib implementation, the external
167+
[`input_scan_line`](https://github.com/ocaml/ocaml/blob/f333db8b0f176b1d75e6fdb46a97a78995426ed7/stdlib/stdlib.ml#L437)
168+
is used to peek inside the `in_channel`'s buffer, breaking the abstraction of `input`.
169+
This shows that peeking into the buffer without consuming it is quite necessary
170+
in practice.
171+
172+
### Compatibility
173+
174+
The current type of channels could retain its interface, for backward compatibility,
175+
in addition to the new interface which exposes `consume` and `fill_buf`,
176+
but implement `input`, in the general case, as follows
177+
(adapted from [tiny httpd](https://github.com/c-cube/tiny_httpd/blob/3ac5510e2d5dfcdf448a03a99c0c178b73afeabd/src/Tiny_httpd.ml#L146)):
178+
179+
```ocaml
180+
let input (ic:In_channel.t) buf i len : int =
181+
let offset = ref 0 in
182+
let continue = ref true in
183+
while continue && !offset < len do
184+
let s, j, n = In_channel.fill_buf ic () in
185+
let n_read = min n (len - !offset) in
186+
Bytes.blit s j buf (i + !offset) n_read;
187+
offset := !offset + n_read;
188+
In_channel.consume ic n_read;
189+
if n_read=0 then continue := false; (* eof *)
190+
done;
191+
!offset
192+
193+
```
194+
195+
In most cases this should only do one iteration if `n` is smaller than the
196+
underlying buffer's size.
197+
198+
**Alternatively**, this can be considered the implementation of `really_input`,
199+
and have input be just:
200+
201+
```ocaml
202+
let input (ic:In_channel.t) buf i len : int =
203+
let s, j, n = In_channel.fill_buf ic in
204+
let n_read = min n len in
205+
Bytes.blit s j buf i n_read;
206+
In_channel.consume ic n_read;
207+
n_read
208+
```
209+
210+
Here we see that the classic `input` is simply the successive application
211+
of `fill_buf` and `consume`.
212+
213+
214+
### Relation to the extensibility aspect
215+
216+
This is compatible with the extensibility approach from above. The new API
217+
would be:
218+
219+
```ocaml
220+
type in_channel =
221+
| IC_raw of old_in_channel (* implemented in C *)
222+
| IC_user of {
223+
fill_buf: unit -> (bytes * int * int);
224+
consume: int -> unit;
225+
close: unit -> unit;
226+
}
227+
```
228+
229+
### Out channel
230+
231+
There is no equivalent need to modify the interface of `out_channel`. Buffered
232+
output is simpler as one doesn't need to look inside the buffer at all.
233+
234+
### Summary
235+
236+
This change would improve the API of input channels, making them more flexible
237+
for some use cases that involve dynamic framing of input. Examples include http1
238+
(as well as its chunked encoding), the Redis protocol, and netstrings.
239+
240+
## Related interfaces
241+
242+
- Batteries has [BatIO](http://ocaml-batteries-team.github.io/batteries-included/hdoc2/BatIO.html)
243+
which also contains an extensible type for channels.
244+
However, it's not compatible with the stdlib's channels, so projects cannot
245+
export such channels in their APIs unless they force the batteries dependency.
246+
- ocamlnet has [Netchannels](http://projects.camlcity.org/projects/dl/ocamlnet-4.1.6/doc/html-main/Netchannels_tut.html)
247+
which is also extensible.
248+

0 commit comments

Comments
 (0)