Refactor metrics to handle multiple subnetworks #936

njgheorghita · 2023-09-27T14:49:22Z

What was wrong?

Our current metrics registration setup is incompatible with running multiple subnetworks. Basically, each subnetwork tries to re-register an equivalently-named metric, which causes an error.

The way we get around this in our current StorageMetrics is to append the subnetwork name to the metric name (eg trin_radius_ratio_History). But this isn't ideal, using a label is much better suited to distinguishing between the same metric for different subnetworks. Furthermore using a label makes creating interesting / useful dashboards in grafana much easier.

How was it fixed?

Created PortalnetMetrics which houses clonable references to both StorageMetrics & OverlayMetrics, which can be used by individual subnetworks
moved PortalnetConfig from portalnet/src/types/messages.rs -> portalnet/src/config.rs

To-Do

Add entry to the release notes (may forgo for trivial changes)
Clean up commit history
Update dashboard json template (will address in following pr)

njgheorghita · 2023-09-29T15:37:42Z

portalnet/src/overlay.rs

@@ -482,11 +479,13 @@ where
    ) -> anyhow::Result<()> {
        match self.validator.validate_content(content_key, content).await {
            Ok(_) => {
-                self.metrics.report_validation(true);
+                self.metrics
+                    .report_validation(&self.protocol.to_string(), true);


My update to the metrics strategy requires the addition of a protocol argument to the metrics reporting functions. I don't love this. I tried a couple ideas to eliminate this design, but I couldn't find something satisfactorily simple. This pr is already somewhat complex, and that would have increased the complexity in exchange for simply adding a protocol argument.

I opted to leave the simple design for now. But, I'd be curious to hear thoughts from others. Is it worth trying to refactor the protocol argument away? Though, if requested, I'd prefer to address those changes for a following pr.

Added StorageMetricsReporter & OverlayMetricsReporter to handle the per-subnetwork metrics reporting, avoiding the need to use an explicit self.protocol throughout OverlayService & Storage

carver · 2023-10-02T23:19:11Z

Looks like this needs a rebase, so I'll just start with a high-level review

carver

Since I suggested a high-level change, and there's a pending refactor, I'll just pause my review for now and resume when you're ready.

carver · 2023-10-02T23:32:32Z

portalnet/src/config.rs

+            internal_ip: false,
+            no_stun: false,
+            node_addr_cache_capacity: NODE_ADDR_CACHE_CAPACITY,
+            metrics: PORTALNET_METRICS.overlay.clone(),


I don't understand why the metrics are inside the config here. It doesn't really seem to belong with the other configured portal network values (simple entries from the CLI). I think the whole PR gets simpler, and the code makes more sense, if the config is left alone and the metrics are just handled separately.

Yeah, good call. That design was cruft leftover from a previous design where I manually created the metrics registry and passed it through the configs to the respective services. Now using the lazy_static to globally initialize eliminates the need for this.

carver

I didn't get through everything, but figured I would send the partial review for you to chew on. I'll wrap it up tomorrow.

carver · 2023-10-04T04:10:33Z

src/lib.rs

+    // Initialize prometheus metrics
+    if let Some(addr) = trin_config.enable_metrics_with_url {
+        prometheus_exporter::start(addr)?;
+    }


The move of the exporter up here used to be required because we were passing it into the config, but we're not doing that anymore. So I think the code move is unnecessary, right?

If it is necessary, a comment explaining why is warranted. If not, nit to reduce code turnover a bit by moving the block back into place.

carver · 2023-10-04T04:14:37Z

portalnet/src/discovery.rs

+use std::hash::{Hash, Hasher};
+use std::net::Ipv4Addr;
+use std::str::FromStr;
+use std::{convert::TryFrom, fmt, fs, io, net::SocketAddr, sync::Arc};


I wonder if there's a clippy warning or something we can add to just have this done once in the code base and then not have to trickle it in over time. (I'm definitely happier with this python-like grouping of library imports at the top: system, 3rd-party, and local)

Are you doing this manually, or using some tool to reorder?

rust-lang/style-team#131 (comment)

https://rust-lang.github.io/rustfmt/?version=v1.4.38&search=#group_imports

I've just been doing this manually (following the python-like grouping) whenever I'm working on a file that I notice doesn't conform. Agree that this isn't ideal, but it also looks like the rustfmt lint is only available on nightly.

What do you think? I could go through the codebase, and update locally using the rustfmt lint. But I'm not sure we have a solid way forward to prevent deterioration over time.

Yeah, I think the status quo is fine for now.

carver · 2023-10-04T04:15:08Z

ethportal-api/src/dashboard/grafana.rs

+// todo: automatically update datasource/dashboard via `create-dashboard` command
+// rather than deleting and recreating them


Well... Imo, this is a very low priority issue. Given the increased activity of our stale issue bot, I don't think it will be addressed anytime soon and will simply be marked as stale before anybody fixes this.

I updated the docs (in book/src/users/monitoring.md) so there shouldn't be any surprises for users wanting to update their dashboard. I'm leaning towards leaving this todo here & not opening an issue. Simply as a reference for any future devs who might be working on improving our grafana workflow, whenever that time may come around.

What do you think?

Eh, I guess I'm still not fully sold on the stale bot. The idea that we are hesitant to create issues that aren't going to be quickly resolved feels like an anti-pattern to me. I don't really see why TODOs peppered throughout the code is any better than a backlog of issues that have been around for a while (I'd argue it is worse, in fact. Partially because it's hard to have a conversation in a code TODO). But I guess the solution I'm leaning toward would handle the TODOs in code: #932 so I'm fine leaving it for now

carver · 2023-10-04T04:19:18Z

portalnet/src/metrics/overlay.rs

+    pub fn new(registry: &Registry) -> anyhow::Result<Self> {
+        let message_count = register_int_counter_vec_with_registry!(
+            opts!(
+                "trin_message_count",


total is the preferred suffix for a generic aggregating counter, according to this Prometheus naming "Best Practices" page: https://prometheus.io/docs/practices/naming/

Ohh, very nice! I hadn't seen that. It's so helpful when you don't have to worry about naming conventions, somebody's already done the thinking for you. I updated a lot of the metrics labels & method names here to correspond more accurately with the best practices.

carver · 2023-10-04T04:22:39Z

portalnet/src/metrics/overlay.rs

            message_count,
            utp_outcome_count,
-            utp_active_count,
+            utp_active_gauge,


👍🏻 for the name improvement

carver · 2023-10-04T04:25:51Z

portalnet/src/metrics/overlay.rs

+
+impl OverlayMetricsReporter {
+    pub fn message_total(&self, direction: &str, message_type: &str) {
+        let protocol: &str = &self.protocol.to_string();


protocol is already a String, it seems like to_string() is doing an extra memory allocation with no benefit. (and repeated below)

carver

With the metric renames, presumably ./ethportal-api/src/dashboard/collected-metrics-dashboard.json.template has gone stale. It's a little clunky to update, but better than forcing everyone to figure out how to manually update their dashboard manually, especially since you should have the best idea what metrics changed to what names. Would you double-check that it's updated?

Ok, I feel a little uneasy that I might have missed something subtle in the refactor, but other than that and having the dashboard get created correctly, it LGTM!

👍🏻 for a little more Prometheus naming consistency too.

carver · 2023-10-04T21:56:51Z

portalnet/src/storage.rs

-impl StorageMetrics {
-    pub fn new(protocol: &ProtocolId) -> Self {


For future PRs: it's much harder to successfully spot problems while reviewing a PR that does a significant refactor and a functionality change at the same time. So it's really nice to split out a refactor into a separate PR that's quick and easy: no change but a shift of everything from one file to another, then a second PR that really highlights what changes were made for the new functionality.

njgheorghita self-assigned this Sep 28, 2023

njgheorghita force-pushed the update-dashboard branch 7 times, most recently from 46974cc to f20500e Compare September 29, 2023 15:06

njgheorghita changed the title ~~[WIP] Refactor metrics to handle multiple subnetworks~~ Refactor metrics to handle multiple subnetworks Sep 29, 2023

njgheorghita commented Sep 29, 2023

View reviewed changes

njgheorghita force-pushed the update-dashboard branch 3 times, most recently from cf3eae6 to 421f121 Compare September 29, 2023 15:44

njgheorghita marked this pull request as ready for review September 29, 2023 15:51

njgheorghita requested review from carver, mrferris, KolbyML and ogenev September 29, 2023 15:51

carver reviewed Oct 2, 2023

View reviewed changes

njgheorghita force-pushed the update-dashboard branch 4 times, most recently from e592a0a to 03d41eb Compare October 3, 2023 17:35

njgheorghita requested a review from carver October 3, 2023 19:07

carver reviewed Oct 4, 2023

View reviewed changes

carver approved these changes Oct 4, 2023

View reviewed changes

njgheorghita force-pushed the update-dashboard branch 3 times, most recently from 28c59ae to b3d4c35 Compare October 5, 2023 17:47

fix: refactor metrics strategy to support multiple subnetworks

65ea371

njgheorghita force-pushed the update-dashboard branch from b3d4c35 to 65ea371 Compare October 5, 2023 17:48

njgheorghita merged commit f1b9578 into ethereum:master Oct 5, 2023

		// todo: automatically update datasource/dashboard via `create-dashboard` command
		// rather than deleting and recreating them

		impl StorageMetrics {
		pub fn new(protocol: &ProtocolId) -> Self {

Refactor metrics to handle multiple subnetworks #936

Refactor metrics to handle multiple subnetworks #936

Uh oh!

Conversation

njgheorghita commented Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was wrong?

How was it fixed?

To-Do

Uh oh!

njgheorghita Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carver commented Oct 2, 2023

Uh oh!

carver left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carver left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njgheorghita Oct 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carver left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njgheorghita commented Sep 27, 2023 •

edited

Loading

njgheorghita Sep 29, 2023 •

edited

Loading

njgheorghita Oct 4, 2023 •

edited

Loading