Skip to content

Add memory usage monitor callback #21245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

DimiChatzipavlis
Copy link

We made the memory monitor callback(CPU/GPU monitoring) according to devs instructions (Issue No:#21150-->Tensorboard integration, support for all backends).We are looking forward to receiving your informative feedback!

Copy link

google-cla bot commented May 3, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@codecov-commenter
Copy link

codecov-commenter commented May 3, 2025

Codecov Report

Attention: Patch coverage is 60.41667% with 76 lines in your changes missing coverage. Please review.

Project coverage is 79.60%. Comparing base (6b74cb0) to head (8f37649).
Report is 40 commits behind head on master.

Files with missing lines Patch % Lines
keras/src/callbacks/memory_usage_callback.py 60.31% 67 Missing and 8 partials ⚠️
keras/api/_tf_keras/keras/callbacks/__init__.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #21245      +/-   ##
==========================================
+ Coverage   76.84%   79.60%   +2.75%     
==========================================
  Files         565      566       +1     
  Lines       54799    55265     +466     
  Branches     8509     8603      +94     
==========================================
+ Hits        42112    43992    +1880     
+ Misses      10543     9233    -1310     
+ Partials     2144     2040     -104     
Flag Coverage Δ
keras 79.41% <58.85%> (+2.72%) ⬆️
keras-jax 63.47% <54.16%> (-0.09%) ⬇️
keras-numpy 58.53% <17.18%> (-0.18%) ⬇️
keras-openvino ?
keras-tensorflow 63.88% <56.25%> (?)
keras-torch 63.50% <53.12%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DimiChatzipavlis DimiChatzipavlis changed the title Add memory uage monitor callback Add memory usage monitor callback May 3, 2025
@fchollet
Copy link
Collaborator

fchollet commented May 5, 2025

Thanks for the PR! Can you link to a Colab showing the callback in action (maybe with different backends)?

@DimiChatzipavlis
Copy link
Author

Hi everyone!
I’ve put together a Colab demo showing the MemoryUsageCallback in action (including a TensorBoard integration):
https://colab.research.google.com/drive/1-vV1D98TtGN5A9Cx37aW_7qE-CtoFBfd?usp=sharing

My callback works for CPU, TensorFlow, PyTorch, and JAX—and even writes scalars to TensorBoard—but OpenVINO doesn’t expose any memory-stats API (and isn’t typically used for training workloads), so it isn’t strictly required here and its tests keep failing in CI. Does anyone have suggestions for:

  • Skipping or mocking out the OpenVINO memory tests?

  • Alternatively, adding minimal OpenVINO support (even just a warning) so that the import_test passes without installation?

Thanks in advance and I’m eager for your feedback, both for the colab and openvino test.

@fchollet fchollet added the keras-team-review-pending Pending review by a Keras team member. label May 27, 2025
@divyashreepathihalli
Copy link
Collaborator

Thank you for the PR @DimiChatzipavlis - taking a look!

via `tf.summary` (TensorBoard).

Args:
monitor_gpu (bool): If True, attempt to measure accelerator memory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add automatic detection instead of this arg?
if running_on_gpu:
..
if running_on_tpu:
...

logic to add these detections

def running_on_tpu():
    backend = keras.config.backend()
    if backend == "jax":
        import jax

        devices = jax.devices()
        return any(d.platform == "tpu" for d in devices)
    elif backend == "tensorflow":
        import tensorflow as tf

        return bool(tf.config.list_logical_devices("TPU"))
    elif backend == "torch":
        return False


def running_on_gpu():
    backend = keras.config.backend()
    if backend == "jax":
        import jax

        devices = jax.devices()
        return any(d.platform == "gpu" for d in devices)
    elif backend == "tensorflow":
        import tensorflow as tf

        return bool(tf.config.list_logical_devices("GPU"))
    elif backend == "torch":
        import torch

        return torch.cuda.is_available()


Args:
monitor_gpu (bool): If True, attempt to measure accelerator memory.
log_every_batch (bool): If True, also log after each batch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the default behavior?
Log at the end of each epoch? - You would document default behavior in docstring

if psutil is None:
raise ImportError(
"MemoryUsageCallback requires the 'psutil' library. "
"Install via `pip install psutil`."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT : "To install please use pip install psutil"

self._log_epoch("start", epoch)

def on_epoch_end(self, epoch, logs=None):
self._log_epoch("end", epoch, offset=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the colab output I am observing that Epoch end is not logged when log_every_batch is False

def _get_cpu_memory(self):
return self._proc.memory_info().rss / (1024**2)

def _get_gpu_memory(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another function to get tpu memory would be needed as well

Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
I tried the Colab on A100 GPU with monitor_gpu set to True
but the logs did not show GPU memory usage
image

@google-ml-butler google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels May 29, 2025
Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments!

@fchollet
Copy link
Collaborator

Please make sure to insert line breaks in log messages so that the logs do not interfere too much with the progress bar printouts.

@google-ml-butler google-ml-butler bot removed the ready to pull Ready to be merged into the codebase label May 31, 2025
@DimiChatzipavlis
Copy link
Author

Hi all—thanks for the feedback! I’ve merged in:
Cleaner epoch/batch prints with leading newlines
Minor doc-string fixes (including pip install hints)
Code changes for better functionality

Colab Link: https://colab.research.google.com/drive/1-vV1D98TtGN5A9Cx37aW_7qE-CtoFBfd?usp=sharing

I’m still seeing CI failures because openvino isn’t installed (API gen and integration tests try to import it). Any advice on conditionally skipping or wrapping OpenVINO so tests pass would be hugely appreciated!

Thanks in advance and I am waiting your feedback!

@divyashreepathihalli
Copy link
Collaborator

Thank you for the updates!
looks like GPU is working but not TPU
image

@DimiChatzipavlis
Copy link
Author

Hi everyone! I’ve added two small tweaks to running_on_tpu():
For TensorFlow TPUs, we now call TPUClusterResolver + initialize_tpu_system() before checking list_logical_devices("TPU") so Colab’s TPU actually comes up.
In _get_tpu_memory(), we fall back to summing bytes_in_use for JAX-TPU.

These changes should finally let the callback detect TPUs in a Colab TPU runtime. Unfortunately I haven’t been able to verify end-to-end in Colab’s TPU runtime (TPUs always report empty-neither me nor my colleague, perhaps due to resources' shortage in the colab), so any tips on a working TPU setup would be appreciated.

Thanks in advance and I am waiting your feedback!

@laxmareddyp
Copy link
Collaborator

Hi @DimiChatzipavlis ,

I cloned your repository and ran the notebook, but noticed the GPU memory allocation reported during each epoch is much lower than expected. For example, with a batch size of 64 and image size 224x224, the first layer should output a tensor of shape 64x224x224x32, which requires about 411MB (calculated as 4×64×224×224×32). However, the callback only reports 60MiB, which is far below the actual memory needed for training—even without rematerialization. This suggests the reported values do not accurately reflect true GPU memory consumption

Refer this gist rematerizalization-with-callback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review keras-team-review-pending Pending review by a Keras team member. size:L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants