-
Notifications
You must be signed in to change notification settings - Fork 19.6k
Add memory usage monitor callback #21245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #21245 +/- ##
==========================================
+ Coverage 76.84% 79.60% +2.75%
==========================================
Files 565 566 +1
Lines 54799 55265 +466
Branches 8509 8603 +94
==========================================
+ Hits 42112 43992 +1880
+ Misses 10543 9233 -1310
+ Partials 2144 2040 -104
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Thanks for the PR! Can you link to a Colab showing the callback in action (maybe with different backends)? |
Hi everyone! My callback works for CPU, TensorFlow, PyTorch, and JAX—and even writes scalars to TensorBoard—but OpenVINO doesn’t expose any memory-stats API (and isn’t typically used for training workloads), so it isn’t strictly required here and its tests keep failing in CI. Does anyone have suggestions for:
Thanks in advance and I’m eager for your feedback, both for the colab and openvino test. |
Thank you for the PR @DimiChatzipavlis - taking a look! |
via `tf.summary` (TensorBoard). | ||
|
||
Args: | ||
monitor_gpu (bool): If True, attempt to measure accelerator memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add automatic detection instead of this arg?
if running_on_gpu:
..
if running_on_tpu:
...
logic to add these detections
def running_on_tpu():
backend = keras.config.backend()
if backend == "jax":
import jax
devices = jax.devices()
return any(d.platform == "tpu" for d in devices)
elif backend == "tensorflow":
import tensorflow as tf
return bool(tf.config.list_logical_devices("TPU"))
elif backend == "torch":
return False
def running_on_gpu():
backend = keras.config.backend()
if backend == "jax":
import jax
devices = jax.devices()
return any(d.platform == "gpu" for d in devices)
elif backend == "tensorflow":
import tensorflow as tf
return bool(tf.config.list_logical_devices("GPU"))
elif backend == "torch":
import torch
return torch.cuda.is_available()
|
||
Args: | ||
monitor_gpu (bool): If True, attempt to measure accelerator memory. | ||
log_every_batch (bool): If True, also log after each batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the default behavior?
Log at the end of each epoch? - You would document default behavior in docstring
if psutil is None: | ||
raise ImportError( | ||
"MemoryUsageCallback requires the 'psutil' library. " | ||
"Install via `pip install psutil`." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT : "To install please use pip install psutil
"
self._log_epoch("start", epoch) | ||
|
||
def on_epoch_end(self, epoch, logs=None): | ||
self._log_epoch("end", epoch, offset=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the colab output I am observing that Epoch end is not logged when log_every_batch is False
def _get_cpu_memory(self): | ||
return self._proc.memory_info().rss / (1024**2) | ||
|
||
def _get_gpu_memory(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another function to get tpu memory would be needed as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments!
Please make sure to insert line breaks in log messages so that the logs do not interfere too much with the progress bar printouts. |
Hi all—thanks for the feedback! I’ve merged in: Colab Link: https://colab.research.google.com/drive/1-vV1D98TtGN5A9Cx37aW_7qE-CtoFBfd?usp=sharing I’m still seeing CI failures because openvino isn’t installed (API gen and integration tests try to import it). Any advice on conditionally skipping or wrapping OpenVINO so tests pass would be hugely appreciated! Thanks in advance and I am waiting your feedback! |
Hi everyone! I’ve added two small tweaks to running_on_tpu(): These changes should finally let the callback detect TPUs in a Colab TPU runtime. Unfortunately I haven’t been able to verify end-to-end in Colab’s TPU runtime (TPUs always report empty-neither me nor my colleague, perhaps due to resources' shortage in the colab), so any tips on a working TPU setup would be appreciated. Thanks in advance and I am waiting your feedback! |
Hi @DimiChatzipavlis , I cloned your repository and ran the notebook, but noticed the GPU memory allocation reported during each epoch is much lower than expected. For example, with a batch size of 64 and image size 224x224, the first layer should output a tensor of shape 64x224x224x32, which requires about 411MB (calculated as 4×64×224×224×32). However, the callback only reports 60MiB, which is far below the actual memory needed for training—even without rematerialization. This suggests the reported values do not accurately reflect true GPU memory consumption Refer this gist rematerizalization-with-callback |
We made the memory monitor callback(CPU/GPU monitoring) according to devs instructions (Issue No:#21150-->Tensorboard integration, support for all backends).We are looking forward to receiving your informative feedback!