-
Notifications
You must be signed in to change notification settings - Fork 147
Refactor metrics to handle multiple subnetworks #936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
46974cc
to
f20500e
Compare
portalnet/src/overlay.rs
Outdated
@@ -482,11 +479,13 @@ where | |||
) -> anyhow::Result<()> { | |||
match self.validator.validate_content(content_key, content).await { | |||
Ok(_) => { | |||
self.metrics.report_validation(true); | |||
self.metrics | |||
.report_validation(&self.protocol.to_string(), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My update to the metrics strategy requires the addition of a protocol
argument to the metrics reporting functions. I don't love this. I tried a couple ideas to eliminate this design, but I couldn't find something satisfactorily simple. This pr is already somewhat complex, and that would have increased the complexity in exchange for simply adding a protocol
argument.
I opted to leave the simple design for now. But, I'd be curious to hear thoughts from others. Is it worth trying to refactor the protocol
argument away? Though, if requested, I'd prefer to address those changes for a following pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added StorageMetricsReporter
& OverlayMetricsReporter
to handle the per-subnetwork metrics reporting, avoiding the need to use an explicit self.protocol
throughout OverlayService
& Storage
cf3eae6
to
421f121
Compare
Looks like this needs a rebase, so I'll just start with a high-level review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I suggested a high-level change, and there's a pending refactor, I'll just pause my review for now and resume when you're ready.
portalnet/src/config.rs
Outdated
internal_ip: false, | ||
no_stun: false, | ||
node_addr_cache_capacity: NODE_ADDR_CACHE_CAPACITY, | ||
metrics: PORTALNET_METRICS.overlay.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the metrics are inside the config here. It doesn't really seem to belong with the other configured portal network values (simple entries from the CLI). I think the whole PR gets simpler, and the code makes more sense, if the config is left alone and the metrics are just handled separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good call. That design was cruft leftover from a previous design where I manually created the metrics registry and passed it through the configs to the respective services. Now using the lazy_static
to globally initialize eliminates the need for this.
e592a0a
to
03d41eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get through everything, but figured I would send the partial review for you to chew on. I'll wrap it up tomorrow.
src/lib.rs
Outdated
// Initialize prometheus metrics | ||
if let Some(addr) = trin_config.enable_metrics_with_url { | ||
prometheus_exporter::start(addr)?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The move of the exporter up here used to be required because we were passing it into the config, but we're not doing that anymore. So I think the code move is unnecessary, right?
If it is necessary, a comment explaining why is warranted. If not, nit to reduce code turnover a bit by moving the block back into place.
portalnet/src/discovery.rs
Outdated
use std::hash::{Hash, Hasher}; | ||
use std::net::Ipv4Addr; | ||
use std::str::FromStr; | ||
use std::{convert::TryFrom, fmt, fs, io, net::SocketAddr, sync::Arc}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a clippy warning or something we can add to just have this done once in the code base and then not have to trickle it in over time. (I'm definitely happier with this python-like grouping of library imports at the top: system, 3rd-party, and local)
Are you doing this manually, or using some tool to reorder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rust-lang/style-team#131 (comment)
https://rust-lang.github.io/rustfmt/?version=v1.4.38&search=#group_imports
I've just been doing this manually (following the python-like grouping) whenever I'm working on a file that I notice doesn't conform. Agree that this isn't ideal, but it also looks like the rustfmt
lint is only available on nightly
.
What do you think? I could go through the codebase, and update locally using the rustfmt
lint. But I'm not sure we have a solid way forward to prevent deterioration over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the status quo is fine for now.
// todo: automatically update datasource/dashboard via `create-dashboard` command | ||
// rather than deleting and recreating them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... Imo, this is a very low priority issue. Given the increased activity of our stale issue bot, I don't think it will be addressed anytime soon and will simply be marked as stale before anybody fixes this.
I updated the docs (in book/src/users/monitoring.md
) so there shouldn't be any surprises for users wanting to update their dashboard. I'm leaning towards leaving this todo
here & not opening an issue. Simply as a reference for any future devs who might be working on improving our grafana workflow, whenever that time may come around.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, I guess I'm still not fully sold on the stale bot. The idea that we are hesitant to create issues that aren't going to be quickly resolved feels like an anti-pattern to me. I don't really see why TODOs peppered throughout the code is any better than a backlog of issues that have been around for a while (I'd argue it is worse, in fact. Partially because it's hard to have a conversation in a code TODO). But I guess the solution I'm leaning toward would handle the TODOs in code: #932 so I'm fine leaving it for now
portalnet/src/metrics/overlay.rs
Outdated
pub fn new(registry: &Registry) -> anyhow::Result<Self> { | ||
let message_count = register_int_counter_vec_with_registry!( | ||
opts!( | ||
"trin_message_count", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total
is the preferred suffix for a generic aggregating counter, according to this Prometheus naming "Best Practices" page: https://prometheus.io/docs/practices/naming/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, very nice! I hadn't seen that. It's so helpful when you don't have to worry about naming conventions, somebody's already done the thinking for you. I updated a lot of the metrics labels & method names here to correspond more accurately with the best practices.
message_count, | ||
utp_outcome_count, | ||
utp_active_count, | ||
utp_active_gauge, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 for the name improvement
portalnet/src/metrics/overlay.rs
Outdated
|
||
impl OverlayMetricsReporter { | ||
pub fn message_total(&self, direction: &str, message_type: &str) { | ||
let protocol: &str = &self.protocol.to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protocol
is already a String
, it seems like to_string()
is doing an extra memory allocation with no benefit. (and repeated below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the metric renames, presumably ./ethportal-api/src/dashboard/collected-metrics-dashboard.json.template
has gone stale. It's a little clunky to update, but better than forcing everyone to figure out how to manually update their dashboard manually, especially since you should have the best idea what metrics changed to what names. Would you double-check that it's updated?
Ok, I feel a little uneasy that I might have missed something subtle in the refactor, but other than that and having the dashboard get created correctly, it LGTM!
👍🏻 for a little more Prometheus naming consistency too.
impl StorageMetrics { | ||
pub fn new(protocol: &ProtocolId) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future PRs: it's much harder to successfully spot problems while reviewing a PR that does a significant refactor and a functionality change at the same time. So it's really nice to split out a refactor into a separate PR that's quick and easy: no change but a shift of everything from one file to another, then a second PR that really highlights what changes were made for the new functionality.
28c59ae
to
b3d4c35
Compare
b3d4c35
to
65ea371
Compare
What was wrong?
Our current metrics registration setup is incompatible with running multiple subnetworks. Basically, each subnetwork tries to re-register an equivalently-named metric, which causes an error.
The way we get around this in our current
StorageMetrics
is to append the subnetwork name to the metric name (egtrin_radius_ratio_History
). But this isn't ideal, using a label is much better suited to distinguishing between the same metric for different subnetworks. Furthermore using a label makes creating interesting / useful dashboards in grafana much easier.How was it fixed?
PortalnetMetrics
which houses clonable references to bothStorageMetrics
&OverlayMetrics
, which can be used by individual subnetworksPortalnetConfig
fromportalnet/src/types/messages.rs
->portalnet/src/config.rs
To-Do