Skip to content

Latest commit

 

History

History
53 lines (27 loc) · 3.64 KB

facia-monitoring.md

File metadata and controls

53 lines (27 loc) · 3.64 KB

Facia Level Metrics & Alerts

A large number of ContentApi requests are made in order to produce each front's pressed.v2.json file. Due to this dependency, the Presser is the point in the architecture that has the most potential for failure. The metrics and alerts relate principally to its ongoing ability to produce pressed.v2.json files.

Some of the metrics are to designed to identify terminally unhealthy processes. Others are informational and indicate problems in the broader network (notably ContentApi). Others monitor the speed at which updates travel through the architecture.

Successive press failures - MANUAL presser (facia-tool)

  • Detail : immediately after a collection is edited within the Fronts Editor, the Presser is invoked to re-create the pressed.v2.json files for any front containing that collection.

  • Metrics : if the number of consecutive failures of the Presser to produce pressed.v2.json files exceeds N, the healthcheck fails. The failure count is reset by any successful pressing.

  • Consequence : Instance terminates.

Succesive press failures - CRON presser (facia-press)

  • Detail : every three minutes, the Admin app puts the ids of every front on an SQS queue. These are pulled in bunches (of up to 10) every 10 seconds by Presser instances, and pressed.v2.json files are re-created for each front.

  • Metrics : if the number of consecutive failures of the Presser to produce pressed.v2.json files exceeds N, the healthcheck fails. The failure count is reset by any successful pressing.

  • Consequence : Instance terminates.

Statictical press failures - CRON presser (facia-press)

  • Detail : due to momentary ContentApi unavailability or increased latency somewhere in the network - pressing will on occasion fail. Above a certain frequency, these failures should be considered as indicative of an operational issue.

  • Metrics : 50 failures within 15 minutes, monitored in CloudWatch.

  • Consequence : PagerDuty alarm

Statictical press failures - MANUAL presser (facia-tool)

  • Detail : due to momentary ContentApi unavailability or increased latency somewhere in the network - pressing will on occasion fail. Above a certain frequency, these failures should be considered as indicative of an operational issue.

  • Metrics : 50 failures within 15 minutes, monitored in CloudWatch.

  • Consequence : PagerDuty alarm

ContentAPI invalid responses

  • Detail : the Presser (and also the fronts tools) are dependent on ContentApi results being well-formed. Occasionally ContentApi returns intermittently malformed results.

  • Metrics : (a) 50 malformed results within 15 minutes, monitored in CloudWatch; (b) a single malformed result monitored by the facia-tool UI.

  • Consequence : (a) PagerDuty alarm; (b) a "red" alert in the UI with an option to "try again".

On edit, press within 10 seconds (not incl. cache time)

  • Detail : The end-to-end process from invoking the Presser to producing a pressed.v2.json file - after a collection is manually edited - should be under 10 seconds. (It typically takes about 3 seconds under normal network conditions.)

  • Metric : the difference between the most recent timestamp of a front's various collection.json files, and the timestamp of that front's pressed.v2.json file. This is checked 10 seconds after any edit; the result itself should be below 10 seconds.

  • Consequence : if the metric exceeds 10 seconds, an "orange" alert appears in the tool UI with an option to "try again". The condition occurs on occasion due to momentary high latency somewhere in network, and generally does not recur on retry.