Skip to content

Exception stack traces on macOS/iOS are in a bad state #4895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
s-ludwig opened this issue Mar 29, 2025 · 40 comments
Open

Exception stack traces on macOS/iOS are in a bad state #4895

s-ludwig opened this issue Mar 29, 2025 · 40 comments

Comments

@s-ludwig
Copy link
Contributor

At least since the change to use "atos" for resolving stack frames, the state of printing call stacks is pretty disastrous on macOS, regardless of the -g switch:

  1. Printing an exception stack trace will freeze the calling thread for an extended period of time (can be a few hundred ms), which means that a process that logs/prints a larger number of exceptions can easily become unusable or would be a very easy DDoS target.

  2. For x86-64 builds in particular, the stack unwinding functionality appears to be broken and only the topmost stack frame is being displayed.

  3. Resolving file names and line numbers doesn't seem to work at all.

  4. Stack traces generated for system crash reports usually stop at the first extern (D) function (I'm not sure whether this might be a simple issue of --frame-pointer, I'll have to check whether that makes a difference)

  5. On some systems atos isn't available at all

In comparison, dmd appears to work fine (stack traces with symbols working regardless and file/line number working correctly when passing -g.

This has been an issue that wanted to dig deeper into for a while now, but never really got to it, so I just wanted to at least get this written down to maybe start a discussion.

@kinke
Copy link
Member

kinke commented Mar 29, 2025

This can hardly be the general case, as macOS CI includes backtrace tests and is happy on both x86_64 and arm64. So at least some aspects of this seem specific to your setup.

Are you talking about unoptimized -g code, or optimized builds incl. -O? In the latter case, the missing frame pointers might be a problem (not just in your code, but druntime/Phobos too, unless using -link-defaultlib-debug). But those are going to be kept by default for Apple targets with the next version (#4889).

The atos-based implementation has been upstreamed, but DMD's druntime most likely doesn't use it because it probably still prevents the linker from stripping the __debug_line section (which we did before LDC v1.31 by hacking LLVM, before the atos-based implementation was merged in #4291).

Are you talking about distributed binaries, where e.g. the .dSYM files might be missing? [Note that I don't have any Apple stuff and cannot test anything myself.] Ref: #4006

@s-ludwig
Copy link
Contributor Author

I've tested with and without -g and with and without optimizations. It's very well possible that there are some system dependent factors here. I'll try to summarize what I got here:

Snippet used for testing:

void foo(int i)
{
	if (i > 0)
		foo(i - 1);
	else assert(false);
}

void bar()
{
	foo(4);
}

void main()
{
	bar();
}

macOS 15.3.1 arm64, latest Xcode, LDC 1.40.1 - same result for ldc2, ldc2 -g, ldc2 -d --frame-pointer=all:

[email protected](5): Assertion failure
----------------
??:? object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) [0x10472b103]
??:? void test.foo(int) [0x1046eb773]
??:? void test.foo(int) [0x1046eb773]
??:? void test.foo(int) [0x1046eb773]
??:? void test.foo(int) [0x1046eb773]
??:? void test.bar() [0x1046eb7bb]
??:? _Dmain [0x1046eb7cf]

With ldc2 -g --frame-pointer=all -O:

[email protected](5): Assertion failure
----------------
??:? object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) [0x102f63103]
??:? void rt.dmain2._d_run_main2(char[][], ulong, extern (C) int function(char[][])*).runAll() [0x102f6b893]
??:? _d_run_main [0x102f6b5bf]
??:? start [0x189f70273]
??:? .constarray [0xffffffffffffffff]

Compiling for x64 (ldc2 -g --frame-pointer=all -march=x86-64 -run test.d):

[email protected](5): Assertion failure
----------------
??:? object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) [0x10015d4d2]

I'm also getting the same result when compiling natively on an Intel Mac, both for macOS 15.3 and macOS 10.15.

Running dsymutil on the executable did fix line numbers for the arm64 build, but not for the x64 one.

However, compiling with LDC 1.31.0 for x64 results in a stack trace comparable to arm64:

[email protected](5): Assertion failure
----------------
??:? _d_assert [0x104d091bf]
??:? void test.foo(int) [0x104b8ba6d]
??:? void test.foo(int) [0x104b8ba4b]
??:? void test.foo(int) [0x104b8ba4b]
??:? void test.foo(int) [0x104b8ba4b]
??:? void test.foo(int) [0x104b8ba4b]
??:? void test.bar() [0x104b8ba8d]
??:? _Dmain [0x104b8ba98]

Interestingly, for LDC 1.31.0 on x64, line numbers are getting displayed correctly after dsymutil has been run - even after deleting the test.dSYM folder and rebuilding the executable with a different name and changed code... but not when using -run. Seems like dsymutil stores/caches something somewhere else, too?!

[email protected](5): Assertion failure
----------------
??:? _d_assert [0x10485a1bf]
/Users/build/test.d:5 void test.foo(int) [0x1046dca6d]
/Users/build/test.d:4 void test.foo(int) [0x1046dca4b]
/Users/build/test.d:4 void test.foo(int) [0x1046dca4b]
/Users/build/test.d:4 void test.foo(int) [0x1046dca4b]
/Users/build/test.d:4 void test.foo(int) [0x1046dca4b]
/Users/build/test.d:10 void test.bar() [0x1046dca8d]
/Users/build/test.d:15 _Dmain [0x1046dca98]

The excessive CPU use of the atos invocations seems to be consistent across all systems, developer machine or not.

@kinke
Copy link
Member

kinke commented Mar 30, 2025

As said above, you need an additional -link-defaultlib-debug to make sure druntime & Phobos keep the frame pointers.

Then my understanding is that Apple's ld64 linker at one point decided not to keep the __debug_line section in the executable anymore (without a switch to keep it AFAIK - as that would resolve all of your problems, incl. runtime performance - atos is only used as fallback), but to reference that DWARF data from the linked object files: https://stackoverflow.com/questions/10044697/where-how-does-apples-gcc-store-dwarf-inside-an-executable/12827463#12827463

With -run, the object file is temporary and removed right after linking. So when running the executable, the DWARF data isn't available anymore. So please don't use -run for these tests here.

AFAIU, running dsymutil extracts the debug infos from the object files to .dSYM files, so that they can be distributed with the executables. Similar to .pdb files on Windows. And atos then takes care of processing these files for us. Ideally, we'd have an in-process library solution as on Windows (dbghelp.dll), but no idea what Apple provides.

@JohanEngelen
Copy link
Member

JohanEngelen commented Mar 30, 2025

Just now I did some testing on my macbook (macOS Sonoma 14.6.1 arm64), with LDC 1.40.1 and got different results for the testsnippet given above.
ldc2(i.e. compile to executable separately from running that executable) --> no file or line numbers, as expected because there is no debug info.
ldc2 -g --> file and line numbers as expected
ldc2 -g --frame-pointer=all --> no delta with -g (as expected, I think)
ldc2 -g --frame-pointer=all -run foo.d --> no line numbers.
ldc2 -g -O --> No file or line numbers as expected, because the code is too simply optimized to nop, and then the stack trace is purely druntime with no debug info linked in.
ldc2 -g -O -frame-pointer=all --> No file or line numbers as expected. Framepointer=all does not prevent inlining/elision etc. It only prevents frame pointer elision in functions that managed to survive other optimizations (possibly helping with stack variable detection), I was not expecting it to help with call stack tracing.

All results are as expected except for no line number with -run. There was no need to run dsymutil for these tests. (although the point #4006 still stands)

The first post contains a number of separate items without clear testcase or actionable items, so I don't know what to do with those. I never had issues with atos taking too long, but also don't really do a lot of programming / testing.

@s-ludwig
Copy link
Contributor Author

As said above, you need an additional -link-defaultlib-debug to make sure druntime & Phobos keep the frame pointers.

The only part from Druntime should be _d_assert here, so I don't think this should be relevant in this case?

W.r.t. -run, I am now getting line numbers for the arm64 build when compiling normally, so maybe that was actually okay all along and there is something else going on in our build setup.

However, the x86-64 build is only showing a single frame, regardless of -run - @JohanEngelen, can you try what happens for you with -march=x86-64?

Regarding atos, calling Throwable.toString() on Windows with a very short call stack takes about 160 us, whereas on macOS the same will take about 430 ms - so it's over 2500 times slower and even logging a single exception is out of the question for many applications. At least on macOS, there is the backtrace/backtrace_symbols API that AFAIR worked fine in my tests a while ago, could that be an alternative?

@kinke
Copy link
Member

kinke commented Mar 30, 2025

Btw guys, what is that .dSYM bundle? From the linked stackoverflow post:

dsymutil can be thought of as a debug info linker. It does this same process -- read the debug map, load the DWARF from the .o files, relocate all the addresses -- and then outputs a single binary of all the DWARF at their final, linked addresses. This is the dSYM bundle.

Once you have a dSYM bundle, you've got plain old standard DWARF that any dwarf reading tool (which can deal with Mach-O binary files) can process.

Does this mean that running dsymutil on a binary fattens it up with DWARF data, in-place? So no extra files to bundle as I thought, but bloating the executable with the debuginfos, so that druntime doesn't have to resort to atos (https://github.com/dlang/dmd/blob/6656cd920cefebff836aaf70fbf4c570f908e959/druntime/src/core/internal/backtrace/dwarf.d#L247-L251)?

@s-ludwig
Copy link
Contributor Author

From what I can see, the original binary stays untouched. The dSYM output is a special folder (bundle) that contains one or more files with DWARF debug information, as well as separate address relocation data as YAML files.

@s-ludwig
Copy link
Contributor Author

At least on macOS, there is the backtrace/backtrace_symbols API that AFAIR worked fine in my tests a while ago, could that be an alternative?

Unfortunately, backtrace_symbols doesn't return line numbers, so scratch that.

@kinke
Copy link
Member

kinke commented Mar 30, 2025

At least on macOS, there is the backtrace/backtrace_symbols API that AFAIR worked fine in my tests a while ago, could that be an alternative?

No, atos is specifically only used to resolve the file/line infos in case the executable doesn't have the section directly (which DMD enforces by setting an unconventional flag; something we can't do in LLVM without hacking it, as far as we know). We don't need it for the rest, i.e., druntime already uses backtrace[_symbols].

If the implementation can be sped up, that should be done upstream, even if DMD can work around this performance issue.

From what I can see, the original binary stays untouched.

Ah too bad, so we do need atos or some other stuff handling the debuginfo reading for us - either from the executable's map and the linked object files, or these extra .dSYM bundles.

@s-ludwig
Copy link
Contributor Author

Yeah, from everything I can find, it looks like atos is unfortunately the only viable option, apart from resorting to using private frameworks, which would mean AppStore distribution wouldn't be possible. In that case I think we should at least have a Druntime runtime option to disable atos in order to avoid performance hazards.

Another possible improvement would be to cache the results of atos so that at least repeating call stacks would be fast.

@JohanEngelen
Copy link
Member

However, the x86-64 build is only showing a single frame, regardless of -run - @JohanEngelen, can you try what happens for you with -march=x86-64?

Yeah I see the same single frame issue here.

@jacob-carlborg
Copy link
Contributor

As long as macOS has used DWARF the linker has stripped all debug info. You can see my comment how this works on macOS: https://github.com/dlang/dmd/blob/605fb8bf5b522f197798ab4a5197883a7fb0c964/compiler/src/dmd/backend/dwarfdbginf.d#L16-L27

@jacob-carlborg
Copy link
Contributor

Yeah, from everything I can find, it looks like atos is unfortunately the only viable option

We can implement whatever atos is doing ourselves. We're reverse engineering everything else, I don't see why the line is drawn right here at atos.

@JohanEngelen
Copy link
Member

Yeah, from everything I can find, it looks like atos is unfortunately the only viable option

We can implement whatever atos is doing ourselves. We're reverse engineering everything else, I don't see why the line is drawn right here at atos.

Can you be more specific on "everything else"? I am personally quite opposed to implementing (and maintaining!!!!!!) things ourselves that already have implementations elsewhere. I feel that there is already too much work for our available resources.

@jacob-carlborg
Copy link
Contributor

Can you be more specific on "everything else"?

  • Compiler
  • Runtime
  • Standard library
  • Build tool
  • Librarian (or whatever it's called, ar)
  • Linker (if anyone is still maintaining the OMF linker)

@JohanEngelen
Copy link
Member

Can you be more specific on "everything else"?

  • Compiler
  • Runtime
  • Standard library
  • Build tool
  • Librarian (or whatever it's called, ar)
  • Linker (if anyone is still maintaining the OMF linker)

:) when I see this list, I see a huge confirmation of my point that it is very bad idea to implement things ourselves. Build tool, librarian and linker, are all very bad when compared to alternatives. Compiler, runtime, and stdlib cannot be avoided when making a new language, and all are sub-par perhaps partly due to the effort spent on the rest... Even for the compiler, "outsourcing" codegen to LLVM (or GCC) is a huge win in quality, performance, and platform support.

Back on topic: implementing atos ourselves did not happen for the >2 years that symbolization was badly broken on macOS with vanilla LLVM (e.g. homebrew ldc), see https://issues.dlang.org/show_bug.cgi?id=20510. I don't think this is a viable solution path.
For faster symbolization / exception work, we can make external tools (file:line lookup) optional. We can also give an option to select another symbolization program (I read that llvm_symbolizer is much faster than atos).

@jacob-carlborg
Copy link
Contributor

I read that llvm_symbolizer is much faster than atos

Well there you go 😉. Seems like llvm-symbolizer is available as a library as well, in true LLVM fashion [1]. LDC can use that at least, DMD won't.

[1] https://llvm.org/doxygen/classllvm_1_1symbolize_1_1LLVMSymbolizer.html

@s-ludwig
Copy link
Contributor Author

s-ludwig commented Apr 25, 2025

Just noticed another issue - the atos processes get left behind as zombie processes and eventually the process will run out of available file descriptors (after around 2400 exception backtraces, further exceptions print "some pipe creation error!").

@kinke
Copy link
Member

kinke commented Apr 26, 2025

One thing I could easily do is restoring the -preserve-dwarf-line-section workaround, by re-hacking the upcoming LLVM 20 binaries used for the official LDC binaries from GitHub.

However, the x86-64 build is only showing a single frame

Is that still the case with v1.41.0-beta1 and its kept frame pointers?

@s-ludwig
Copy link
Contributor Author

-march x86-64 does seem to work as expected with v1.41.0-beta1 now!

@kinke
Copy link
Member

kinke commented Apr 28, 2025

Okay as expected, and if you had tested -link-defaultlib-debug as suggested earlier, we should have known in advance. ;)

@s-ludwig
Copy link
Contributor Author

@kinke I actually did and it didn't make a difference for x86-64. But can you explain why the stack walking the top frames would be affected by DRuntime code when that only occurs in the lower stack frames?

@kinke
Copy link
Member

kinke commented Apr 28, 2025

I actually did and it didn't make a difference for x86-64

That doesn't make a lot of sense - with -frame-pointer=all for the new code? Anyway, doesn't matter anymore.

But can you explain why the stack walking the top frames

I think it starts from the top (innermost) frames, which are in druntime (_d_throw_exception and friends), so if that goes wrong already, it can't make any progress.

@s-ludwig
Copy link
Contributor Author

Hm, of course, I completely forgot about the exception related code that precedes the actual stack walking invocation.

I don't think I used -frame-pointer=all and -link-defaultlib-debug together, though, so that might have worked indeed.

@LunaTheFoxgirl
Copy link
Contributor

LunaTheFoxgirl commented May 10, 2025

This can hardly be the general case, as macOS CI includes backtrace tests and is happy on both x86_64 and arm64. So at least some aspects of this seem specific to your setup.

Are you talking about unoptimized -g code, or optimized builds incl. -O? In the latter case, the missing frame pointers might be a problem (not just in your code, but druntime/Phobos too, unless using -link-defaultlib-debug). But those are going to be kept by default for Apple targets with the next version (#4889).

The atos-based implementation has been upstreamed, but DMD's druntime most likely doesn't use it because it probably still prevents the linker from stripping the __debug_line section (which we did before LDC v1.31 by hacking LLVM, before the atos-based implementation was merged in #4291).

Are you talking about distributed binaries, where e.g. the .dSYM files might be missing? [Note that I don't have any Apple stuff and cannot test anything myself.] Ref: #4006

With everything said, the current atos implementation should probably be replaced with one that calls libatos.dylib directly instead of piping it to the program.

The API interface is neatly documented in /System/Library/PrivateFrameworks/CoreSymbolication.framework/Versions/A/Resources/libatos/libatos.py

I may make a PR with this change if wanted. llvm-symbolizer and co. calls into the same core systems as libatos, so anything else is really just extra layers of indirection.

In this case, I feel atos should only be linked in in debug builds, as yeah, in release builds you can't rely on private frameworks; and if you target iOS you don't have symbolication anyways. (symbolication in that case is handled on the xcode instance)

@kinke
Copy link
Member

kinke commented May 10, 2025

I may make a PR with this change if wanted.

Well that sounds great to me! :)

In this case, I feel atos should only be linked in in debug builds

Assuming you only need few symbols from libatos, I guess trying to load it dynamically (once) and looking up the symbols via dlsym would do nicely. IIRC, that's how Microsoft's dbghelp.dll is used on Windows.

Edit: Some Windows refs: https://github.com/dlang/dmd/blob/master/druntime/src/core/sys/windows/dbghelp.d, https://github.com/dlang/dmd/blob/8d90d93c00f3bca8589b84c750ac85b2d87ad1a5/druntime/src/core/sys/windows/stacktrace.d#L358-L397

@LunaTheFoxgirl
Copy link
Contributor

I may make a PR with this change if wanted.

Well that sounds great to me! :)

In this case, I feel atos should only be linked in in debug builds

Assuming you only need few symbols from libatos, I guess trying to load it dynamically (once) and looking up the symbols via dlsym would do nicely. IIRC, that's how Microsoft's dbghelp.dll is used on Windows.

Edit: Some Windows refs: https://github.com/dlang/dmd/blob/master/druntime/src/core/sys/windows/dbghelp.d, https://github.com/dlang/dmd/blob/8d90d93c00f3bca8589b84c750ac85b2d87ad1a5/druntime/src/core/sys/windows/stacktrace.d#L358-L397

Yeah, more or less what I plan to do.

@LunaTheFoxgirl
Copy link
Contributor

I did some further investigation, having good exception stack traces is not a usecase apple supports and the atos behaviour is not portable and not supported by apple.

I've made a PR that instead provides a hook so that a dub package can provide the atos hack without making it difficult to get apps approved for AppStore release.
#4930

@kinke
Copy link
Member

kinke commented May 15, 2025

Lol, Apple is really hilarious, worse than Microsoft ever was.

@JohanEngelen
Copy link
Member

I did some further investigation, having good exception stack traces is not a usecase apple supports and the atos behaviour is not portable and not supported by apple.

Got a reference?

@LunaTheFoxgirl
Copy link
Contributor

LunaTheFoxgirl commented May 15, 2025

I did some further investigation, having good exception stack traces is not a usecase apple supports and the atos behaviour is not portable and not supported by apple.

Got a reference?

It's more implicit than any specific statement and extrapolation; for example Atos is backed by a PrivateFramework, Apple only has documentation for using atos after the fact and libatos while having symbols also relies on the DebugSymbols private framework that you are not allowed to use. Anything in PrivateFrameworks is a blanket no-go for production software that's being published on the AppStore.

See: https://developer.apple.com/app-store/review/guidelines/#software-requirements

While I can't show specific examples (NDA), Apple does thoroughly scan your application for disallowed symbols. If your app links to a private framework or does something it's not supposed to using tools inside of private frameworks; you are considered to be using non-public API. As such it's safest for publishing mobile apps with D to not have any direct references to atos in druntime.

Down the line adding a naïve dSYM lookup routine to load DWARF info from them should be possible; but in this case this is just something Apple doesn't really support and adding such a feature would only provide partial coverage.

@s-ludwig
Copy link
Contributor Author

So it sounds like the -preserve-dwarf-line-section approach is the best we can do for release binaries? Apart from retroactive symbolification, of course (which doesn't really work out well for stack traces).

Lol, Apple is really hilarious, worse than Microsoft ever was.

They had a good starting point with the BSD base, but that and the C based CoreFramework APIs, are basically the only stable part you can really rely on. Everything else is in a constant flux of deprecation, often without replacement, and usually under-documented, both things that you can't blame Microsoft for. Unfortunately, they are too popular to ignore. (To be fair, Android is in a similar category)

@kinke
Copy link
Member

kinke commented May 16, 2025

So it sounds like the -preserve-dwarf-line-section approach is the best we can do for release binaries?

FWIW, I've re-applied the hack in our LLVM 20 branch, and CI-tested it temporarily. If I haven't made any mistake (so really tested it, not atos accidentally), then there were no issues - especially no #3864 linker warnings/errors anymore, which were another reason for the atos approach.

@jacob-carlborg
Copy link
Contributor

Is it possible to store a copy of the debug info in a custom section? Would that help? Or does that require a patched LLVM as well?

@kinke
Copy link
Member

kinke commented May 16, 2025

That would surely require another LLVM hack, a more intrusive one.

@LunaTheFoxgirl
Copy link
Contributor

LunaTheFoxgirl commented May 16, 2025

That would surely require another LLVM hack, a more intrusive one.

Could possibly be the realm of an external D tool to find the generated dSYM, then embedding them after the fact into the mach-o object.

Such a tool wouldn't need to care about whether or not it uses private APIs.

If wanted I can look into this as a possibility @jacob-carlborg @s-ludwig

@jacob-carlborg
Copy link
Contributor

jacob-carlborg commented May 16, 2025

Yeah, fine by me. Or directly in the compiler? Or from the object files, then a dSYM doesn't need to be generated.

@LunaTheFoxgirl
Copy link
Contributor

LunaTheFoxgirl commented May 16, 2025

Yeah, fine by me. Or directly in the compiler? Or from the object files, then a dSYM doesn't need to be generated.

I think going by dSYM might be preferred since then D libraries symbols can also be embedded. And also reduces the hackiness needed directly in the compiler. The less hacks that need to be maintained in a fork of LLVM the better, I think.

@jacob-carlborg
Copy link
Contributor

It doesn't need to be in LLVM, in can be in LDC, just as it can be in a separate tool.

@LunaTheFoxgirl
Copy link
Contributor

It doesn't need to be in LLVM, in can be in LDC, just as it can be in a separate tool.

Yeah that could work. Mainly just focusing first on getting the ugly hacks removed that risk breaking production software first. Easier to take a wholistic look at it after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants