Skip to content

[wip][core] Add tcmalloc and enable on demand memory profile for GCS #38443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 45 commits into from

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Aug 15, 2023

Why are these changes needed?

This PR compile tcmalloc into ray for two purposes:

  • Make Ray's memory footprint better
  • Enable developers able to do on demand memory profiling.

gRPC endpoint has been added:

grpcurl -use-reflection -plaintext -d '{"dump_prefix": "/tmp/debugxxx", "opts": {"lg_prof_interval": "5"}, "duration": 5}'  localhost:6379 ray.rpc.DebugService/StartMemoryProfile

And then with pprof, we shall get:

pprof /tmp/debugxxx.0002.heap
File: gcs_server
Type: inuse_space
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 224.45kB, 96.51% of 232.57kB total
Dropped 36 nodes (cum <= 1.16kB)
Showing top 10 nodes out of 71
      flat  flat%   sum%        cum   cum%
   69.61kB 29.93% 29.93%       81kB 34.83%  opencensus::stats::Delta::Record
   62.68kB 26.95% 56.88%    88.45kB 38.03%  ray::rpc::ServerCallFactoryImpl::CreateCall
   53.10kB 22.83% 79.71%    53.10kB 22.83%  gpr_malloc
   10.12kB  4.35% 84.07%    10.12kB  4.35%  opencensus::stats::MeasureData::MeasureData
    7.50kB  3.22% 87.29%     7.50kB  3.22%  google::protobuf::internal::AllocateMemory
    6.64kB  2.86% 90.15%     6.64kB  2.86%  std::vector::_M_realloc_insert
    5.39kB  2.32% 92.47%     5.39kB  2.32%  grpc_core::Server::RequestRegisteredCall
    4.98kB  2.14% 94.61%     7.31kB  3.14%  ray::rpc::InternalKVGcsService::WithAsyncMethod_InternalKVPut::RequestInternalKVPut
    2.51kB  1.08% 95.68%     2.51kB  1.08%  std::__cxx11::basic_string::_M_construct
    1.91kB  0.82% 96.51%     2.81kB  1.21%  ray::rpc::InternalKVGcsService::WithAsyncMethod_InternalKVGet::RequestInternalKVGet

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
@fishbone fishbone changed the title Leak umm [core] Add tcmalloc and enable on demand memory profile for GCS Aug 16, 2023
@fishbone fishbone marked this pull request as ready for review August 16, 2023 01:56
Signed-off-by: Yi Cheng <[email protected]>
@fishbone fishbone marked this pull request as draft August 16, 2023 01:57
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Comment on lines -12 to -14
# Set compiler here to build Ray with CLANG/LLVM
ENV CC=clang
ENV CXX=clang++-12
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these flags somehow mess up the build. https://buildkite.com/ray-project/oss-ci-build-pr/builds/32272#018a0079-0255-4c9d-bb9d-095edf048e65

I'm not expert there and it took me a long time to realize it's because of these flags.

Another thing is that we build linux wheel with gcc (manylinux docker), so it seems doesn't make sense using clang to run CI.

It's added by this PR, #31522

I'll see whether it can pass the arm build or not by just removing this. Otherwise, I'll add a flag there to enable it only for arm build.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. if you want to make this change, could you put it in a separate PR? I guess the change is not tightly tied to this change in this PR, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. stil working on this one.

BUILD.bazel Outdated
@@ -8,7 +8,7 @@

load("@rules_proto//proto:defs.bzl", "proto_library")
load("@rules_python//python:defs.bzl", "py_library", "py_runtime", "py_runtime_pair")
load("@rules_cc//cc:defs.bzl", "cc_binary", "cc_library", "cc_proto_library", "cc_test")
load("@rules_cc//cc:defs.bzl", "cc_library", "cc_binary", "cc_proto_library", "cc_test")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why the order of loading needs to change? seems that it was alphabetically ordered before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still iterate on this one because CI failed to compile. I'll put WIP there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bazel format fixed this.

@fishbone fishbone changed the title [core] Add tcmalloc and enable on demand memory profile for GCS [wip][core] Add tcmalloc and enable on demand memory profile for GCS Aug 18, 2023
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
Signed-off-by: Yi Cheng <[email protected]>
@fishbone fishbone closed this Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants