Description
At Status, we have been fighting against resource leaks for months (sockets, file descriptors, futures, timers and most importantly memory)
- Half closed vacp2p/nim-libp2p#174
- prevent transport tests from leaking vacp2p/nim-libp2p#134
- Experiment with destructors, use them to cleanup peerinfo leaks vacp2p/nim-libp2p#318
- fix gossipsub memory leak on disconnected peer vacp2p/nim-libp2p#371
- status-im/nim-eth@6ec942a
- Bug/fix transport leaks #37 status-im/nim-eth#39
- Properly fix cancellation race and not introduce FD leaks. status-im/nim-chronos#102
- release callback memory early status-im/nim-chronos#130
- Cleanup deleted timers status-im/nim-chronos#65
- Continuous calls to write() after error may cause memory leaks. status-im/nim-chronos#53
- Sync master with devel status-im/nimbus-eth2#1673
- remove potentially GC leak-inducing acyclicity tag status-im/nimbus-eth2#1011
- memory leak during block syncing status-im/nimbus-eth2#850
- Memory leak during sync status-im/nimbus-eth2#1629
- TransportOsError on
make witti
status-im/nimbus-eth2#1123 - Asynchronous unattended Future[T] tracking. status-im/nimbus-eth2#1121
- Rework pubsub status-im/nimbus-eth2#1474 (comment)
- Memory leak: fix the cyclic-ness of the DAG status-im/nimbus-eth2#1010
- The current beacon_node version has a memory leak status-im/nimbus-eth2#447
- [ongoing] Network stability status-im/nimbus-eth2#784
- suspiciously high number of new futures being created in the local sim status-im/nimbus-eth2#779
- properly close connections vacp2p/nim-libp2p#128
We built several tools to help us track the issues
- add stream metrics vacp2p/nim-libp2p#136
- More memory and perf profiling vacp2p/nim-libp2p#207
- Initial tracking mechanism. status-im/nim-chronos#33
- RFC: Future[T] leak or unattended Future[T] tracking. status-im/nim-chronos#31
- some metrics for monitoring futures status-im/nim-chronos#85
- RFC: Transports leak tracking. status-im/nim-chronos#32
- Memory-accounting solution status-im/nimbus-eth2#869
- [WIP] Zero-cost unattended Future[T] tracking mechanism. status-im/nim-chronos#106
However we are still leaking. As a workaround we are currently advising people to restart every 6 hours or so but for production we need to remove all possible source of leaks so that user can run that application for months without restart.
The computational part of our application is relatively easy to debug for leaks but the async/IO/networking part as been leaking resources quite often via closure iterators/futures due to unattended cancellation/expiration.
This is incredibly hard to debug.
We would like to have -d:useMalloc
available for the default GC, backported to the 1.2.x branch so that we can use conventional C tools like Valgrind to detect those memory leaks and also run memory leak detection in dedicated CI.
Furthermore, we are currently investigating an issue with misreported memory accounting by Nim GC, which makes it even harder to debug with Nim standard tools (memory fragmentation?).