Skip to content

Implicit source positions #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions rfcs/implicit-source-positions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Implicit source positions

A new type of arrow is introduced. Syntactically, it looks like:

```ocaml
val f : loc:[%call_pos] -> ...
```

in a signature, and

```ocaml
let f = fun ~(loc : [%call_pos]) ... ->
...
```

in an implementation.

The label name doesn’t matter, but there must be a label. The function parameter thus defined has type `Lexing.position` and, if not specified at a call site, is automatically filled with the position of the call site.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find RFC quite unclear as to what positions you concretely get in the resulting Lexing.position values and if it is possible to specify multiple implicit source positions. Basically the notion of "position of the call site" is quite unclear to me.

E.g. if I have:

let f ~(loc0 : [%call_pos]) a0 ~(loc1 : [%call_pos]) a1 = …

And call

let () = f "hey" "ho" 

what do loc0 and loc1 point to in f. This:

let () = f "hey" "ho" 
           ^     ^
           |     |
           loc0  loc1

or something else ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think loc0 and loc1 will be equal and refer to the location of the whole expression f "hey" "ho".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what do I get if I do:

let g = f "hey" 
let () = g "ho"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would guess that loc0 will be the location of f "hey" and loc1 that of g "ho".

Copy link

@jberdine jberdine Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question I would have as a user of this feature is how the presence of an implicit source position parameter might affect code transformations the compiler might perform. For example, I would suppose with the example f above that f "hey" "ho" and (f "hey") "ho" might lead to different values of loc0 and loc1. This would make me worry that the implicit source position parameter might block something like contification. Such transformations aren't explained as part of the source language, so are probably out of scope for the RFC, but just as a data point, I would need to figure this out before using the feature.

Copy link
Contributor

@dbuenzli dbuenzli Mar 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if it's possible to access the location of the application here without fuss (if that's where the magic happens), I would rather suggest to change the terminology to locations (i.e. [%call_loc]) and perhaps simply type those as string * int * int * int to be understood as fname, line_num, start_byte_offset, end_byte_offset with offset relative to the first byte of line_num1; in other words the (multiline aware) format of OCaml error messages.

Having the whole range of the application enables quite a few more tricks :–)

Footnotes

  1. It's not the best format as you have to count lines in the source string to find the line first byte position, but it has the same memory footprint, if people care about that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming and the representation choice came from the fact that this is how the Lexing module presents the information, both in terms of content and name ("position"). I think we should consider a change to the stdlib if we are going to chance the specification of this feature similarly.

Copy link
Contributor

@dbuenzli dbuenzli Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm not complaining about the representation choice or information presentation, I'm complaining about the actual bits of information you report: one position versus two positions (a "text location").

In any case I'm not sure I see why a change to the stdlib is needed.

The whole idea of using the Lexing module (which is a runtime module for lexers generated by ocamllex) was a bit odd in the first place. For example the magic debugging variable confusingly named __POS_OF__ simply uses a structural type to report a location.

Copy link
Member

@gasche gasche Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: having two positions lets us denote a span/interval within the source, rather than just a point. This is the difference between Lexing.position, which is just one point, and Location.t which is an interval.

type t = Warnings.loc =
  { loc_start: position; loc_end: position; loc_ghost: bool }

I would also expect a source location/span/interval as opposed to a source position. (I have the impression that the choice of a source position goes back to ocaml/ocaml#126 .)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FTR Printexc does use a record type for reporting the same kind of information, see Printexc.location. Perhaps something to consider consolidating.

Note that the semantics is not better documented there though. There's a lot to unpack in the format of OCaml error messages –– namely byte start and end positions relative to the first byte of a line identified by one-based line number; with positions understood as defined in the introduction of the OCaml string module. It would be a good idea to once solve ocaml/ocaml#8978 and make all the mecanisms that report this kind of information refer to it.

This can serve several purposes:

- When chaining monadic binds, the usual stack backtraces are very noisy and less useful. You can replace `(>>=) : 'a t -> ('a -> 'b t) -> 'b t` with `(>>=) : loc:[%call_pos] -> 'a t -> ('a -> 'b t) -> 'b t` to automatically track binding locations (and then implement a tracing facility).
- It can be used to improve the usefulness of testing frameworks (see e.g. https://github.com/mirage/alcotest/pull/366) or tracing systems (such as meio[^1]), or error reporting in embedded languages.

## History

A first PR[^2] has been proposed ten years ago by @let-def. After some design amendments, that PR was stalled for a while, until renewed interested was expressed in 2023 by @TheLortex. The feature was then discussed during a developer meeting and gathered consensus on the principle[^3]. It was subsequently implemented at Jane Street and has been in use there for several months.

The goal of this RFC is therefore to present the details of the design in use at Jane Street to make sure there is agreement to port the feature upstream. Jane Street has asked Tarides for help in the summarizing work that led to this RFC.

## Design

### Typing

Implicit position arguments behave sort of like optional arguments: if not provided, some default value (the call site’s position) is passed. But their semantics is neither that of optional arguments, nor of the usual labelled arguments. So they are a new kind of function parameter (or, theoretically speaking, a new class of function arrow types).

At callsites, you can specify that an argument is `[%call_pos]` to help type inference infer the right type for a function. Usually this is only relevant in higher-order functions:

```ocaml
let wrap_function b f g ~here =
if b
then f ~(here : [%call_pos]) ()
else g ~(here : [%call_pos]) ()
```

### Controlling file paths in source positions

When implementing this feature, Jane Street also introduced a new compiler flag, `-directory`. When specified, `-directory some/path` causes the filename in automatically inserted positions to be prepended with `some/path`. Jane Street’s build system uses that flag to control the file paths reported by `[%call_pos]`. For example, this can be used to give more informative file paths than just a single `./filename.ml`, all the while ensuring a reproducible result and avoiding to include machine-specific paths like `/home/user/build/.../filename.ml` in the compiled program.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part seems orthogonal and indeed probably best left out of an initial PR. Also, note that in Dune, for example, all compilation invocations are made from the workspace root, so that the paths appearing in compiler locations are relative to the workspace root and not bare filenames.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We found that the feature was not very useful without this -directory bit, hence its inclusion here. But maybe our Dune is different from the upstream Dune in this way and that upstream Dune will work without -directory.


When discussing this flag with @MisterDA, he informed me that the compiler supports using the [`BUILD_PATH_PREFIX_MAP`](https://reproducible-builds.org/specs/build-path-prefix-map/) environment variable to achieve reproducible builds, and it looks like maybe the function achieved by `-directory` flag could be implemented with that, instead. But it’s not entirely clear how well `BUILD_PATH_PREFIX_MAP` is supported across the ecosystem, for example, whether Dune uses it or supports it. There also seems to be [some caveats](https://github.com/ocaml/ocaml/issues/8665) regarding absolute paths.



## References

- [PR on the Jane Street compiler](https://github.com/ocaml-flambda/flambda-backend/pull/2401)


[^1]: https://github.com/ocaml-multicore/meio
[^2]: https://github.com/ocaml/ocaml/pull/126
[^3]: https://github.com/ocaml/ocaml/pull/126#issuecomment-1582519924