-
Notifications
You must be signed in to change notification settings - Fork 36
Implicit source positions #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Implicit source positions #52
Conversation
This feature has widespread use already within Jane Street. It was originally implemented by @vivianyyd, with much help getting it into production by @Enoumy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I am in favour. Incidentally, we had such a facility in LexiFi's compiler fork since 2008 and recently we removed it in order to reduce the diff with upstream. Will be happy to get it back as an official feature! :)
|
||
### Controlling file paths in source positions | ||
|
||
When implementing this feature, Jane Street also introduced a new compiler flag, `-directory`. When specified, `-directory some/path` causes the filename in automatically inserted positions to be prepended with `some/path`. Jane Street’s build system uses that flag to control the file paths reported by `[%call_pos]`. For example, this can be used to give more informative file paths than just a single `./filename.ml`, all the while ensuring a reproducible result and avoiding to include machine-specific paths like `/home/user/build/.../filename.ml` in the compiled program. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part seems orthogonal and indeed probably best left out of an initial PR. Also, note that in Dune, for example, all compilation invocations are made from the workspace root, so that the paths appearing in compiler locations are relative to the workspace root and not bare filenames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We found that the feature was not very useful without this -directory
bit, hence its inclusion here. But maybe our Dune is different from the upstream Dune in this way and that upstream Dune will work without -directory
.
As mentioned previously, there is broad maintainer consensus in favor of the feature. I suppose that the purpose of the RFC is more to discuss the proposed syntax and the practical details. The proposed syntax looks fine to me. The I think that we may need a PR that implements the language-level feature to answer these questions, unless you can use the Jane Street compiler with upstream dune to make tests and report on their behavior. |
Note that it's hard to answer these questions without fixing My gut feeling is that with a correctly working |
As mentioned, all Dune compiler invocations are done from the workspace root, so compiler paths are paths relative to the workspace root as well, so I think the answer to this question is "yes".
|
Thanks! This suggest that we could forget about the |
Yes, this does suggest we could drop the |
in an implementation. | ||
|
||
The label name doesn’t matter, but there must be a label. The function parameter thus defined has type `Lexing.position` and, if not specified at a call site, is automatically filled with the position of the call site. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find RFC quite unclear as to what positions you concretely get in the resulting Lexing.position
values and if it is possible to specify multiple implicit source positions. Basically the notion of "position of the call site" is quite unclear to me.
E.g. if I have:
let f ~(loc0 : [%call_pos]) a0 ~(loc1 : [%call_pos]) a1 = …
And call
let () = f "hey" "ho"
what do loc0
and loc1
point to in f
. This:
let () = f "hey" "ho"
^ ^
| |
loc0 loc1
or something else ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think loc0
and loc1
will be equal and refer to the location of the whole expression f "hey" "ho"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what do I get if I do:
let g = f "hey"
let () = g "ho"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would guess that loc0
will be the location of f "hey"
and loc1
that of g "ho"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question I would have as a user of this feature is how the presence of an implicit source position parameter might affect code transformations the compiler might perform. For example, I would suppose with the example f
above that f "hey" "ho"
and (f "hey") "ho"
might lead to different values of loc0
and loc1
. This would make me worry that the implicit source position parameter might block something like contification. Such transformations aren't explained as part of the source language, so are probably out of scope for the RFC, but just as a data point, I would need to figure this out before using the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that if it's possible to access the location of the application here without fuss (if that's where the magic happens), I would rather suggest to change the terminology to locations (i.e. [%call_loc]
) and perhaps simply type those as string * int * int * int
to be understood as fname, line_num, start_byte_offset, end_byte_offset
with offset relative to the first byte of line_num
1; in other words the (multiline aware) format of OCaml error messages.
Having the whole range of the application enables quite a few more tricks :–)
Footnotes
-
It's not the best format as you have to count lines in the source string to find the line first byte position, but it has the same memory footprint, if people care about that. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming and the representation choice came from the fact that this is how the Lexing
module presents the information, both in terms of content and name ("position"). I think we should consider a change to the stdlib if we are going to chance the specification of this feature similarly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I'm not complaining about the representation choice or information presentation, I'm complaining about the actual bits of information you report: one position versus two positions (a "text location").
In any case I'm not sure I see why a change to the stdlib is needed.
The whole idea of using the Lexing
module (which is a runtime module for lexers generated by ocamllex
) was a bit odd in the first place. For example the magic debugging variable confusingly named __POS_OF__
simply uses a structural type to report a location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: having two positions lets us denote a span/interval within the source, rather than just a point. This is the difference between Lexing.position
, which is just one point, and Location.t
which is an interval.
type t = Warnings.loc =
{ loc_start: position; loc_end: position; loc_ghost: bool }
I would also expect a source location/span/interval as opposed to a source position. (I have the impression that the choice of a source position goes back to ocaml/ocaml#126 .)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FTR Printexc
does use a record type for reporting the same kind of information, see Printexc.location
. Perhaps something to consider consolidating.
Note that the semantics is not better documented there though. There's a lot to unpack in the format of OCaml error messages –– namely byte start and end positions relative to the first byte of a line identified by one-based line number; with positions understood as defined in the introduction of the OCaml string module. It would be a good idea to once solve ocaml/ocaml#8978 and make all the mecanisms that report this kind of information refer to it.
Conversation seems to have died down here, with the following results:
In types: e.g. val f : ?loc:[%call_pos] -> unit -> unit In function definitions: e.g. let f ?(loc = [%call_pos]) () = ()
(* or *)
let f = fun ?(loc = [%call_pos]) () -> () In function call sites: e.g. let () = f () (* omitted *)
let () = f ~loc:some_lexing_position () (* supplied *)
let () = let loc = some_lexing_position in f ~(loc : [%call_pos]) () (* supplied, with a hint to type inference that [loc] is a call-pos parameter, not a labeled one *)
let () = f ~loc:(some_lexing_position : [%call_pos]) () (* more direct syntax for the same thing *) These last two examples will be very rare in practice, useful only really in higher-order contexts where we can't know the type of Note also that it is never allowed to pass a call-pos parameter with
Does that summarize the state of play well here? Once this gets merged in, we at Jane Street will adjust our use of this feature to match the new syntax. Thanks! |
The summary is missing the discussion about:
Regarding 1. I think this should be implicit source locations which span the application and thus the syntax should rather be Regarding 2. I think it would be fine to reuse what has be used for |
I wouldn't say it's rarely used at call sites, you often use let primitive ?(limit = 3) x = …
let derived_combinator1 ?limit y = … primitive ?limit …
let derived_combinator2 ?limit y = … primitive ?limit … |
Yes, thanks for reminding me about this! I think that (2) determines (1) -- that is, the actual data captured should dictate the name. After asking around within Jane Street, there is considerable enthusiasm for keeping just the single position, rather than the whole span. We don't need the full span, and the extra information is likely to have a performance impact (this surprised me, too). So the thought here is to keep the existing It would not be hard to take the existing implementation of |
My intuition is that providing a location/span is likely to have various usability benefits in practice for some use-cases. (It's not even obvious what the 'position' of a function call is, I suppose we mean the starting position?) Given that this feature is precisely defined for usability, I would be inclined to prioritize this instead of performance. I wouldn't believe the performance argument without more explanations. Can you elaborate what the issue would be? Note that If performance is expected to be a concern, I would propose using an abstract type for the span/location, with accessors that return various information -- could start as just |
The runtime representation that Another take would also be to simple have Personally I'm not super fond of these ad-hoc primitives so I think we should rather aim for having a single one and one that allows to express the functionality of As far as naming goes, text locations vs position seems quite established in the compiler and beyond so I think simply having
I think I have already mentioned at least two times that this is a red herring, you can simply reuse what |
I'm quite interested in the proposed feature. I currently use `__FILE__`
and `__LINE__` for tracing-related purposes. I think @dbuenzli is right
to suggest the reuse of the existing types.
In terms of performance, just to make sure, the use of `__POS__` or `__POS_OF__` should not
allocate at runtime, correct? The compiler should put it into a static
variable?
```
/tmp $ /bin/cat truc.ml
let f x =
ignore (Sys.opaque_identity __POS__);
x
/tmp $ ocamlopt -dclambda truc.ml
clambda:
(seq
(let
(f/270
(closure
(fun camlTruc.f_270 1 x/272
(seq
(seq
(opaque "camlTruc.2"=block(0,"camlTruc.1"="truc.ml",3,30,37))
0)
x/272)) ))
(setfield_ptr(root-init) 0 (read_symbol camlTruc) f/270))
0)
camlTruc.1:
"truc.ml"
camlTruc.2:
block(0,"camlTruc.1"="truc.ml",3,30,37)
```
…--
Simon Cruanes
|
On the "performance" angle, there's really two issues:
Now, I don't expect that my "Conversion" point will be convincing if you think that spans are always preferable, but
As mentioned above, surfacing positions for logging/telemetry/debugging tools is definitely the main use case for this feature.
I like fwiw, I'm not suggesting that we ignore spans, I think it'd be nice to have both |
If all you do is format, unlikely. It will not be different from the transform performed by the code you linked below:
That's not a very good format. Unlike OCaml's error and stacktrace format, it will routinely fail to locate the right bit on sources with Unicode characters. Besides personally, in my program, I'd rather not have a different syntax for stacktraces and instrumentation statements – there's no need to overload my brain with different notations. So I wouldn't take that as an argument in favour of the information given by
I'm not against adding a type for locations as formatted by OCaml. Also as already mentioned there is one in
It absolutely does handle multiple lines. There seems to be misunderstandings about OCaml's error location and stacktrace reporting format which I explained in the discussion here and which is what |
Just to make that clear with a location (* This is the same function as Ppx_here.lift_position_as_string. *)
let make_location_string (pos_fname, pos_lnum, pos_first, _pos_last) =
String.concat
[ pos_fname; ":"; Int.to_string pos_lnum; ":"; Int.to_string pos_first ]
;; |
That's fine; you can always use
If I wanted to format a span for people to read and understand, then I'd expect (start-line start-col), (end-line, end-col), not whatever "pos_last" is.
If I'm reading that string, it's not obvious at all what span of code is covered by the "72" there. Maybe a tool could use that in a useful way, but as I've mentioned, the primary usecase here is for people to read.
Could you expand on this? It sounds like you're saying that building a
Looks like Oh also, neither do type error messages:
|
That's not the point, what I'm saying is that the link you gave on how you want to format things for users with a
What I said, and as shown by my snippet, is that there are going to be no conversion costs between locations as I want them and positions as you want them, as you claimed there would be in your point 2.
I does if you have a properly configured editor, that understands the semantics. Trying to count visual character in the Unicode world is a doomed endeavour, even counting Unicode scalar values won't do, you need feedback from the rendering layer. Talking in (relative as far as OCaml's format is concerned) bytes with the editor is the only reliable way. In that respect the GNU standard error format which expects you to count visual columns with Now the thing with OCaml's error information is that it is in fact a marvel of nice tradeoffs, these 3 integers – which again have the same memory footprint as a
|
Here is a hopefully fair summary of the debate between “point location” and “span location”. A full location span would be more general than a single position, however @TyOverby expresses concerns about the consequences in terms of binary size and conversion costs. @dbuenzli makes the point than using the same format as used by
takes up the same size as the currently used And indeed with a position encoded in this format, one can do everything that can be done with Suggestion: we could make the location type abstract with accessors as suggested by @gasche so that the actual representation is an implementation detail. For starters, we could implement it using the format described above, as it takes up no more space than the current proposal and allows to do the same things without cost, and also to derive a span. In the future, if needed, we can always switch to a more compact and flexible representation, such as the one used in native debug info (see ocaml/ocaml#10111). |
I'm convinced that we should favor usability over performance here. The performance costs that we are talking about are mostly minor (how large already are the js_of_ocaml outputs that we would be adding tens of thousands of words to?), and they can be further reduced by being clever. |
Maybe I’m already tired at this time of the afternoon, but it’s not entirely clear from your message what part of my suggestion you are objecting to, and what you propose to replace it with. |
Your summary is fine! I am clarifying my own opinion: after consideration and the further discussion, I am now convinced that we should use locations and not positions. |
Some further comments. You suggest to start with the same internal representation as POS_OF, and I think this is very reasonable. Deciding whether to hide the representation within an abstract type or not is not obvious:
Designing an abstract type for this is not obvious, in particular because for anything non-structural we need this type to be defined in the standard library -- not compiler-libs. Maybe an option would be to introduce a new type in the Maybe a dummy proposal to get things going: module Lexing.Location : sig
type t
val file : t -> string
val start_pos : t -> Lexing.position
val end_pos : t -> Lexing.position
val start_line : t -> int
val end_line : t -> int
val start_column: t -> int (** TODO document 0- or 1-indexing and that it is a byte offset *)
val end_column: t -> int
val start_offset : t -> int (** TODO document that it is a byte offset *)
val end_offset : t -> int
val to_POS_tuple : t -> (string * int * int * int) (** See {!__POS_OF__}. *)
end I am not sure that it is possible to implement Edit: the constructor for this abstract type is missing from this API sketch. There should be one. I suppose that the compiler would be synchronized with the internal implementation, and unsafely inject a structured constant matching this representation. |
@OlivierNicole it's not clear what you mean by the start position, but this is not what the OCaml error format does:
Both offsets are relative to the start of the line identified by the line number.
For humans. But for machines, if the process in charge of highlighting the location information is smart about it, it allows to overcome the problems posed by Unicode (by specifying the location precisely in the text encoding space rather than in the visual space). Regarding reusing the format of I did however wonder whether keeping the same format would allow to use
@gasche, note that if you want to keep the memory footprint of Regarding a suggestion, in a few projects now I have been using this for text locations which allows to recover all forms of error reporting (OCaml-like or broken GNU multi-lines) but it's a bit more heavy weight (6 integers, though they could be flatten in a single record). Something I would like to understand is why two more field in |
I'm not worried about having 6 integers, as one could probably pack them into three Int63 values. (I'm assuming it is reasonable to assume that these line numbers or byte offsets inside one single source file will each fit 31 bits.) If we use an abstract type, it would also be easy for people who maintain a fork of the compiler to choose a different internal representation that preserves less information, and return |
Another option would be to have a string with all these numbers packed in sequence one after the other using a simple unsigned variable byte encoding with a continuation bit. This is what is done for example in source maps (signed there apparently and wrapped in a base64 layer). The nice thing about the |
Aside, but if people want to look at some prior work on representations for positions and spans, it might be interesting to look at hack's Pos module and its dependencies. The size of these is significant for hack. There are files with ridiculously long lines, and some with ridiculously large numbers of lines. While the smallest packs a span into a single int, not every span can fit and so there are several representations and a compress operation that selects the smallest usable representation. |
True! But it still means that we have twice as many locations statically allocated in the binary as necessary -- and it requires extra boilerplate in functions to do the conversion. I'm probably more concerned about the latter (manually calling the conversion function) than the former. I agree that this seems to address the runtime performance concern. I still don't see what's so bad about having two features, catering to different use-cases. We have many integer types (each with its own literal) and two built-in collection types (each with its own literal). Instead, we could have just one integer type and one collection type and then ask programmers to convert -- that would be as expressive. But it's a poor experience, and so we provide multiple different types. That's all I'm arguing for here. |
My impression is that we would not be be providing a better experience to end-users by providing two almost-identical features with related (but not parametrized) syntaxes, but rather pushing complexity costs onto them (and onto future OCaml maintainers and contributions) to avoid making a decision now. I don't see a scenario where end-users would clearly benefit from having both options available (either they only want one position and either design is just fine, or they want a span and they cannot use My impression right now is that if we magically decide to only support locations/spans upstream (as I think was originally intended by let-def) and merge a reasonable implementation of that, then Jane Street will be just fine -- there are plenty of options, supporting both features in a transition period or forever, or keeping just your code as you like it, or proposing a refactoring to make it trivial for both features to share as much code as possible. I sort of wish I had been forceful about this three weeks ago when we realized the position-vs-location misunderstanding, we might have been done by now. |
I still don't see how this is different than with literals -- I could make the same argument around integer literals, keeping only the largest literal form. Those, too, are not parametrized. It would make me sad to see the conversation go the way you're suggesting. Up until now, all the surface-language features we have written, we have expected to be able to get, in some form, into the upstream language. The truth is that we don't want to maintain a difference forever, but the spans feature simply does not fit our needs. We can, of course, maintain our version. But it's surprising to me that our claims of "this won't work for us" seem not to suggest that it might also not work for others. After all, the goal here is to reduce boilerplate, and the suggestion you're making for this feature would fail to do that (as much as is possible) for our use-case. |
As far as I read you in this conversation the "this won't work for us" seems rather to be related to a data structure mismatch ( This would of course not apply to other users who have never been subjected to the feature. So I don't see any evidence so far "that it might not work for others" too. I also didn't see exactly what – except for the mismatch problem – in spans does not fit your need. Once you manage to make your Could you perhaps be more specific ? So far you have only shown this which is trivial to support with a span. |
Yes that's basically true. But I don't see how it invalidates my claim about others -- the Even if it were abstract, though (and thus could be more easily migrated to have a different representation) I think I'd still want the ability to talk about single positions, which presumably take up less space to encode. My motivation would be weaker, admittedly, but not gone entirely. It seems silly to require us to synthesize full spans throughout our programs if all we want are single locations. The difference I'm proposing between |
These people did not use implicit positions in their code… Anyways, I don't feel this discussion is productive so I'm going to stop here. I'm not the one making decisions anyways. But I'll just say that while I generally think Jane Street's work on the language is very beneficial for OCaml I'm not very fond of the way this particular RFC is being handled. The way I personally see it is that it is not being argued from an OCaml system design perspective, it is argued from an "avoid a technical debt for Jane Street" perspective. |
Just a bit of biteshedding here if I may:
Even if it were abstract, though (and thus could be more easily
migrated to have a different representation) I think I'd still want
the ability to talk about single positions, which presumably take up
less space to encode. My motivation would be weaker, admittedly, but
not gone entirely. It seems silly to require us to synthesize full
spans throughout our programs if all we want are single locations.
See, here you use locations to refer to positions. Sure you use single
locations which may slightly help to disambiguate, but not completely
IMHO, especially given that:
The difference I'm proposing between `[%call_pos]` and `[%call_loc]`
Here you use loc in contrast with pos to refer to spans.
Unless locations do mean spans to a lot of people, I'd tendto suggest
notto use that term, or to use it forpositions but then not to use pos
at all, well I guess what I need is that I feel it wouldn't hurt to
clarify this.
|
@shindere the whole system is confused by these terms and their semantics and, as this discussion has repeatedly shown, they are not even clear to people working on the compiler. Here's a glimpse on the confusion:
That being said if people find it would be an opportunity to clarify I wouldn't object to Footnotes
|
Thanks for what you share, @dbuenzli.
That being said if people find it would be an opportunity to clarify I
wouldn't object to `[%call_span]` and a corresponding `Textspan`
module[^1]
That would be my take. I mean, beyon the specific names you suggest, I
think it would be nice if we tried to make things as consistentas
possible.
[^1]: I remember pondering using `Cmarkit.Textspan` but eventually
stelled on `Cmarkit.Textloc` because the notion of `span` exists in
the commonmark language (code span, math span) and in HTML.
I would have thought they refer all to the same thing and that,
precisely, would be the reason for usingspan everywhere, because it's
widely understood as meaning text region?
Sorry if I missed something here.
… |
I would say that in the end it's not entirely clear. As your program may need vocabulary and types for:
Otherwise said: type textloc = …
type textspan = string * textloc If you have used |
Ah I think I understand what you mean. And of course for locations it's
a bit different because you do not have two things.
In the case of the compiler though my understanding was that there was
no real ambiguity that the only stored thing is the numbers delimiting
the span. But I can understand that, as far as users are concerned one
may want to have both.
If you think about it this is true for audio and video, too.
So maybe one could have a polymorphic span with a payload if one wants
to embody the associated content.
|
I apologize for using “location” where I should have used “position” (which, until this conversation, held the same meaning for me — the On to the substance: I am arguing for having two different ways of representing implicit source whatevers for two reasons:
My guess is that @gasche, @dbuenzli, and I will not be able to convince each other to change position (or location!) at this point. So I’m curious how others feel here. Please weigh in! If others here agree that having two different possibilities here would really harm the language, I could see myself retreating. But as of right now, I still think having two possibilities would make OCaml a better language. |
I am a bit wary of wading into the quicksand :) From my perspective, the discussion feels a bit "theoretical" in that (at least I) do not have a good picture of what are the potential uses of this feature that may need "locations" instead of "positions". More precise descriptions of these uses may be useful to move the discussion forward. For what is worth at LexiFi we carried for many years a version of this feature that used a "position" type and it worked very well for us (used exclusively for debugging and logging), but these were of course very specific uses and are not representative. I would not be shocked to have both variants in the language; it does not seem very different from |
I agree, it would also make sense to hope for separate locations for the function being called and for each argument. (The position of the application is presumably a join of that.) |
I became interested by this feature as a user and occasional maintainer of Alcotest, a testsuite for OCaml. Ideally, it would expose a function module Source_code_position : sig
type here = Lexing.position
(** Location information passed via a [~here] argument, intended for use with
a PPX such as {{:https://github.com/janestreet/ppx_here} [ppx_here]}. *)
type pos = string * int * int * int
(** Location information passed via a [~pos] argument, intended for use with
the [__POS__] macro provided by the standard library. See the
documentation of [__POS__] for more information. *)
end
type 'a extra_info =
?here:Source_code_position.here -> ?pos:Source_code_position.pos -> 'a
(** The assertion functions optionally take information about the {i location}
at which they are called in the source code. This is used for giving more
descriptive error messages in the case of failure. *)
val check : ('a testable -> string -> 'a -> 'a -> unit) extra_info I've seen in the wild arguments labeled Now, as the compiler doesn't (yet) insert source code {posi,loca}tions, I'm walking the callstack to find the first call outside of the library, which must be the site of the Take this dummy program that raises somewhere down the stack, and look at the printed backtrace: $ cat >test.ml <<EOF
let f () =
assert
false
let g () =
Sys.opaque_identity (f ())
let () =
Sys.opaque_identity (g ())
EOF
$ ocamlfind opt -g test.ml && OCAMLRUNPARAM=b ./camlprog.exe
Fatal error: exception Assert_failure("test.ml", 2, 4)
Raised at Test.f in file "test.ml", lines 2-5, characters 4-11
Called from Test.g in file "test.ml" (inlined), line 8, characters 24-30
Called from Test in file "test.ml", line 11, characters 24-30 If I understand correctly, at each point in the callstack the compiler has stored a range (location), except in the exception itself. I want to replace the problematic walk of the callstack (what if there's no debug info? inlining?) with the implicit source location, but I'd rather not have fewer information than what walking the callstack can give me. A bit orthogonal to the discussion, but the dream feature would be that Alcotest detects if it's running in a CI. If so, for instance with GitHub Actions, it can print annotations that the CI recognizes. One of them would allow the pretty-printer of the assertion failure to link back to the source code within the GitHub UI (see Workflow commands / Setting an error message). The GitHub workflow language supports code spans ({start,end}×{column,line}). I'd rather use this feature to its full extend. I'm also wondering how backward-compatibility will be achieved. If a library exposes |
We could indeed have a
I would personally be happy to provide my blessing to anyone motivated to do it, but I would not count on an "official" implementation as part of the upstream integration work. |
I don't think we should spend too much time on this: this is a language change, and we don't typically try to make new language features compatible with older compilers. |
@nojb said
Some have been provided here. In general as @MisterDA hinted it can be useful for precise application-level stacktraces for specific operations in your system. @gasche said
I think we are never going to get to something if we start factoring in this in :–) Note that with spans you can already trace your function argument locations modulo a call to a combinator or a high-precedence unary operator: trace_arg : ?loc:[%call_loc] -> 'a -> 'a * textloc
let ( !! ) = trace_arg as far as I'm concerned that would be good enough® – and in fact exactly what I would end up doing for my replacement of @goldfirere said
Note that doing this optimization would also entail for you not to depend on I think it would be worth nailing down a bit some of the possibilities. Let's say that a representation is:
I think either one or both of these two things could be done, each with a different variant (a or b).
A few comments:
That being said what would really upset me is to do only 1. or only "1. as a start" :–) |
Thanks @dbuenzli for that analysis -- very helpful. In talking about this all (with @OlivierNicole) I also came to the conclusion that we would have to pay back that technical debt one way or another. So I may spend part of my afternoon tackling that -- we'll see. But I'm more convinced than I was that I shouldn't worry about that detail. I'm also convinced that we should not just do (1). My argument at this point is simply that I don't want to just do (2) -- let's do both. I have a harder time choosing between (a) and (b) within (1) and (2). My instinct is that (b) is the right setting within (1). After all, if you want more than just to print a standard |
That reminds me: something that may not be clear from my terminology is that "lossy" implies likely inaccurate in the column, both for the machine and the human, if you have Unicode characters on the line.
Note that 2b can express 2a. So if we are smart about the way we devise the API for accessing 2a (instead of just using a concrete Also another thing to consider is that compact |
Well in fact lossy doesn't exist. 1b is is not lossy, it's indirect. If you compute So if you render that byte offset pretending to be a GNU |
So where do we sit here? It feels like there is support for 1b over 1a. And if we have an abstract interface for 2, then the details matter less there. Are we happy to have both 1 (of some form) and 2 (of some form)? Would it be helpful at this point to ask the committee to guide us to a decision? |
My preference is to do (2a) as a first step, while writing the code in such a way that it's easy for Jane Street to maintain (1) if they want and share most of the code. For example, one concrete way I proposed to do this is to first implement both (1) and (2) in the parsetree and below (I envision a construct that takes a variant type with two possible constructors), and then remove the code for (1) (same construct, same variant type, just one constructor left). In this direction (which I understand may not be unanimous) the main blocker is that I'm still not sure what API we want to expose for the type of program locations. I'm happy with the idea of exposing a If I wanted to converge quickly in this direction I would try to iterate on a concrete API, or maybe kindly ask @dbuenzli to come up with one and see what this gives (he is often the most demanding API critic, so starting from something he likes saves time). |
My understanding after discussing with @goldfirere is that Jane Street would rather use a language that is as close as possible as the OCaml that everyone else uses, and maybe one day the same language; so they would rather not use a non-exposed feature of the language. IIUC @dbuenzli and @gasche are against doing both 1a and 2a, because 2a contains strictly more information than 1a with no performance penalty, and they estimate that avoiding technical debt for Jane Street does not justify having two forms of position arguments. So maybe a decision that would make everyone slightly happier is to have both 1b and 2b as proposed by @goldfirere, if Jane Street can convince the other maintainers that 2b will not be performant enough for them, and if they don’t mind paying the work of switching away from Also noting that depending on which performance indicators should be optimized, |
My general reasoning so far has been that making the type of implicitly-inferred locations abstract can give us leeway to change things in the future, to move to richer representations and/or to more optimized representations. I think that any sensible representation would work in practice, and in particular a pair of Lexing.position would be just fine. But apparently we are unable to converge towards a consensus: Jane Street insists that we really must support exactly what they implemented because moving away to something more general is too difficult for them, and I don't like going away from the simplicity of exposing a single representation to users. Perhaps we could try, instead of having a single abstract type that strikes a good compromise, to have a GADT that lists several different types. We can then evolve the design by adding more representations later on, which is a user-visible change (people have to change their code to benefit from richer or more efficient representations) but perhaps something that we can compromise on. For example this could look like this: type _ implicit_arg =
| Pos : Lexing.position implicit_arg
| Loc : Textloc.t implicit_arg
(* maybe someday:
| Call_loc : Textloc.t list implicit_arg
*)
val f : ?loc:[%implicit_arg Loc] -> ...
let f ?(loc=[%implicit_arg Loc]) = ...
(* loc has type Textloc.t, derived from the type of Loc *) |
I have to say that I'm not super fond of @gasche, "let's not make a choice in the end" proposal. For me the problem is that we are in total conjecture limbo as far as (non technical debt) costs are concerned. For example I wouldn't be so sure to follow @gasche that a pair of lexing positions would be just fine. There are two main costs that are interesting to look at:
Isn't there a way to make a few real world measures on JaneStreet's executable to weigh the costs and make informed design decision ? I think the following would be useful:
Equiped with 1) we can compute precisely the storage costs of each variants. Informed by 2) we can adapt the representation for the common case, for example if we have a |
I appreciate the discussion here, but I'd like to push back slightly on
While I work for Jane Street, I argue here as me -- informed by my knowledge of what my colleagues want. I don't mean to insist on anything; instead, I'm advocating for allowing programmers to use different types for different use-cases (which does not sound so controversial to me). Furthermore, I've agreed that the new syntax proposed in this discussion is better than our currently-implemented one, and I said up-thread that I would explore further the feasibility of changing our representation (indeed, my advocacy for 1b above would require that change of representation). So I think we're all showing evidence of having a give-and-take conversation of what would be best here. Now, on to technical matters: I don't love @gasche's new proposal. It definitely would work for Jane Street -- but I think it's a worse design. It exhibits the modest complexity of having multiple different source indicator types, but has extra complexity in the encoding. This design might be good if we expected to have heterogeneous collections of source information, but I doubt we would want that. @dbuenzli: I appreciate these questions, but answering them is hard. I don't see a good way of assessing this without instrumenting the compiler to count the occurrences and then somehow working with the build system to accumulate the results somehow. Or maybe it would just print out the result and then we'd have some parser to extract those results from the output? Something like that would probably bork our build system's expectations, though. And we have many different executables... which to test? Similarly to figure out how the information is used, among 65+ million lines of OCaml code. I'm not saying these questions are impossible to answer -- just that it would take real time to produce answers, and it's not clear to me yet that I should ask Jane Street to invest those resources. (For example: what answers could I have to these questions that would change the debate?) Instead, I asked a few colleagues about this, and the general consensus was that it's mostly used for printing errors (as you would expect). The received wisdom was also that changing away from The other datum I have is that |
("argue here as me": noted, apologies.)
I'm not sure what you mean by "extra complexity in the encoding". In particular if you meant that we would need bits at runtime to distinguish among the different cases, then I would disagree: the GADT is compile-time information for the compiler to decide which value (a
For space, it may be possible to write a tool that looks at the constant sections of your binaries, and counts the number of constant values that look like locations. You would get a sense of the portion of binary size taken by location information, and the result would probably be that it is neglectible. For time, the easiest way I can think of is to write a function that does a shallow copy of |
Ok no problem. I thought perhaps you had an easy way to test experimental compilers on a build of your code base (and thus just modify the compiler to increment a counter in a global file each time the implicit location writing code path gets hit).
It would be interesting to investigate when these |
Rendered version
Introduction of the RFC: