Skip to content

Refactor metrics to handle multiple subnetworks #936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 5, 2023

Conversation

njgheorghita
Copy link
Collaborator

@njgheorghita njgheorghita commented Sep 27, 2023

What was wrong?

Our current metrics registration setup is incompatible with running multiple subnetworks. Basically, each subnetwork tries to re-register an equivalently-named metric, which causes an error.

The way we get around this in our current StorageMetrics is to append the subnetwork name to the metric name (eg trin_radius_ratio_History). But this isn't ideal, using a label is much better suited to distinguishing between the same metric for different subnetworks. Furthermore using a label makes creating interesting / useful dashboards in grafana much easier.

How was it fixed?

  • Created PortalnetMetrics which houses clonable references to both StorageMetrics & OverlayMetrics, which can be used by individual subnetworks
  • moved PortalnetConfig from portalnet/src/types/messages.rs -> portalnet/src/config.rs

To-Do

  • Add entry to the release notes (may forgo for trivial changes)
  • Clean up commit history
  • Update dashboard json template (will address in following pr)

@njgheorghita njgheorghita self-assigned this Sep 28, 2023
@njgheorghita njgheorghita force-pushed the update-dashboard branch 7 times, most recently from 46974cc to f20500e Compare September 29, 2023 15:06
@njgheorghita njgheorghita changed the title [WIP] Refactor metrics to handle multiple subnetworks Refactor metrics to handle multiple subnetworks Sep 29, 2023
@@ -482,11 +479,13 @@ where
) -> anyhow::Result<()> {
match self.validator.validate_content(content_key, content).await {
Ok(_) => {
self.metrics.report_validation(true);
self.metrics
.report_validation(&self.protocol.to_string(), true);
Copy link
Collaborator Author

@njgheorghita njgheorghita Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My update to the metrics strategy requires the addition of a protocol argument to the metrics reporting functions. I don't love this. I tried a couple ideas to eliminate this design, but I couldn't find something satisfactorily simple. This pr is already somewhat complex, and that would have increased the complexity in exchange for simply adding a protocol argument.

I opted to leave the simple design for now. But, I'd be curious to hear thoughts from others. Is it worth trying to refactor the protocol argument away? Though, if requested, I'd prefer to address those changes for a following pr.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added StorageMetricsReporter & OverlayMetricsReporter to handle the per-subnetwork metrics reporting, avoiding the need to use an explicit self.protocol throughout OverlayService & Storage

@njgheorghita njgheorghita force-pushed the update-dashboard branch 3 times, most recently from cf3eae6 to 421f121 Compare September 29, 2023 15:44
@njgheorghita njgheorghita marked this pull request as ready for review September 29, 2023 15:51
@carver
Copy link
Collaborator

carver commented Oct 2, 2023

Looks like this needs a rebase, so I'll just start with a high-level review

Copy link
Collaborator

@carver carver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I suggested a high-level change, and there's a pending refactor, I'll just pause my review for now and resume when you're ready.

internal_ip: false,
no_stun: false,
node_addr_cache_capacity: NODE_ADDR_CACHE_CAPACITY,
metrics: PORTALNET_METRICS.overlay.clone(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the metrics are inside the config here. It doesn't really seem to belong with the other configured portal network values (simple entries from the CLI). I think the whole PR gets simpler, and the code makes more sense, if the config is left alone and the metrics are just handled separately.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call. That design was cruft leftover from a previous design where I manually created the metrics registry and passed it through the configs to the respective services. Now using the lazy_static to globally initialize eliminates the need for this.

@njgheorghita njgheorghita force-pushed the update-dashboard branch 4 times, most recently from e592a0a to 03d41eb Compare October 3, 2023 17:35
@njgheorghita njgheorghita requested a review from carver October 3, 2023 19:07
Copy link
Collaborator

@carver carver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get through everything, but figured I would send the partial review for you to chew on. I'll wrap it up tomorrow.

src/lib.rs Outdated
Comment on lines 48 to 51
// Initialize prometheus metrics
if let Some(addr) = trin_config.enable_metrics_with_url {
prometheus_exporter::start(addr)?;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The move of the exporter up here used to be required because we were passing it into the config, but we're not doing that anymore. So I think the code move is unnecessary, right?

If it is necessary, a comment explaining why is warranted. If not, nit to reduce code turnover a bit by moving the block back into place.

Comment on lines 1 to 5
use std::hash::{Hash, Hasher};
use std::net::Ipv4Addr;
use std::str::FromStr;
use std::{convert::TryFrom, fmt, fs, io, net::SocketAddr, sync::Arc};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a clippy warning or something we can add to just have this done once in the code base and then not have to trickle it in over time. (I'm definitely happier with this python-like grouping of library imports at the top: system, 3rd-party, and local)

Are you doing this manually, or using some tool to reorder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rust-lang/style-team#131 (comment)

https://rust-lang.github.io/rustfmt/?version=v1.4.38&search=#group_imports

I've just been doing this manually (following the python-like grouping) whenever I'm working on a file that I notice doesn't conform. Agree that this isn't ideal, but it also looks like the rustfmt lint is only available on nightly.

What do you think? I could go through the codebase, and update locally using the rustfmt lint. But I'm not sure we have a solid way forward to prevent deterioration over time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the status quo is fine for now.

Comment on lines +15 to +16
// todo: automatically update datasource/dashboard via `create-dashboard` command
// rather than deleting and recreating them
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To issue?

Copy link
Collaborator Author

@njgheorghita njgheorghita Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... Imo, this is a very low priority issue. Given the increased activity of our stale issue bot, I don't think it will be addressed anytime soon and will simply be marked as stale before anybody fixes this.

I updated the docs (in book/src/users/monitoring.md) so there shouldn't be any surprises for users wanting to update their dashboard. I'm leaning towards leaving this todo here & not opening an issue. Simply as a reference for any future devs who might be working on improving our grafana workflow, whenever that time may come around.

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I guess I'm still not fully sold on the stale bot. The idea that we are hesitant to create issues that aren't going to be quickly resolved feels like an anti-pattern to me. I don't really see why TODOs peppered throughout the code is any better than a backlog of issues that have been around for a while (I'd argue it is worse, in fact. Partially because it's hard to have a conversation in a code TODO). But I guess the solution I'm leaning toward would handle the TODOs in code: #932 so I'm fine leaving it for now

pub fn new(registry: &Registry) -> anyhow::Result<Self> {
let message_count = register_int_counter_vec_with_registry!(
opts!(
"trin_message_count",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total is the preferred suffix for a generic aggregating counter, according to this Prometheus naming "Best Practices" page: https://prometheus.io/docs/practices/naming/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, very nice! I hadn't seen that. It's so helpful when you don't have to worry about naming conventions, somebody's already done the thinking for you. I updated a lot of the metrics labels & method names here to correspond more accurately with the best practices.

message_count,
utp_outcome_count,
utp_active_count,
utp_active_gauge,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 for the name improvement


impl OverlayMetricsReporter {
pub fn message_total(&self, direction: &str, message_type: &str) {
let protocol: &str = &self.protocol.to_string();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protocol is already a String, it seems like to_string() is doing an extra memory allocation with no benefit. (and repeated below)

Copy link
Collaborator

@carver carver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the metric renames, presumably ./ethportal-api/src/dashboard/collected-metrics-dashboard.json.template has gone stale. It's a little clunky to update, but better than forcing everyone to figure out how to manually update their dashboard manually, especially since you should have the best idea what metrics changed to what names. Would you double-check that it's updated?

Ok, I feel a little uneasy that I might have missed something subtle in the refactor, but other than that and having the dashboard get created correctly, it LGTM!

👍🏻 for a little more Prometheus naming consistency too.

Comment on lines -763 to -764
impl StorageMetrics {
pub fn new(protocol: &ProtocolId) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future PRs: it's much harder to successfully spot problems while reviewing a PR that does a significant refactor and a functionality change at the same time. So it's really nice to split out a refactor into a separate PR that's quick and easy: no change but a shift of everything from one file to another, then a second PR that really highlights what changes were made for the new functionality.

@njgheorghita njgheorghita force-pushed the update-dashboard branch 3 times, most recently from 28c59ae to b3d4c35 Compare October 5, 2023 17:47
@njgheorghita njgheorghita merged commit f1b9578 into ethereum:master Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants