Skip to content

Commit 990972d

Browse files
authored
Merge pull request #521 from pietroalbini/crates-io-postmortem
[inside-rust] publish 2020-02-20 crates.io incident report
2 parents fb7a506 + 6235a0b commit 990972d

File tree

1 file changed

+123
-0
lines changed

1 file changed

+123
-0
lines changed
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
layout: post
3+
title: "crates.io incident report for 2020-02-20"
4+
author: Pietro Albini
5+
team: the crates.io team <https://www.rust-lang.org/governance/teams/crates-io>
6+
---
7+
8+
On 2020-02-20 at 21:28 UTC we received a report from an user of crates.io that
9+
their crate was not available on the index even after 10 minutes since the
10+
upload. This was a bug in the crates.io webapp exposed by a GitHub outage.
11+
12+
## Root cause of the outage
13+
14+
In some corner cases the code that uploads new commits to the GitHub repository
15+
of the index was returning a successful status even though the push itself
16+
failed. The bug caused the job scheduler to think the upload was actually
17+
successful, causing the job to be removed from the queue and producing a data
18+
loss.
19+
20+
The outage was caused by that bug, triggered by an unexpected response during
21+
[the GitHub outage][gh-outage] happening at the same time.
22+
23+
[gh-outage]: https://www.githubstatus.com/incidents/bd29l6zgr43g
24+
25+
## Resolution
26+
27+
The team analyzed the code of the background job uploading commits to the
28+
index, and found a possible cause of the misreported success. A team member
29+
[wrote the fix][fix], another one reviewed it and we then deployed the patch
30+
directly to production.
31+
32+
At the same time, once we saw the index started to be updated again, we removed
33+
the broken entries in the database manually and asked the reporter to upload
34+
their crates again.
35+
36+
[fix]: https://github.com/rust-lang/crates.io/pull/2207
37+
38+
## Affected crates
39+
40+
- [`kaze`](https://crates.io/crates/kaze) 0.1.6
41+
- [`wasmer-runtime-core`](https://crates.io/crates/wasmer-runtime-core) 0.14.0
42+
- [`wasmer-win-exception-handler`](https://crates.io/crates/wasmer-win-exception-handler) 0.14.0
43+
44+
## Postmortem
45+
46+
Deploying the change took way longer than expected: there were changes landed
47+
in master but waiting to be deployed on production, increasing the length of
48+
the build process and the risks of the deploy. In the future we should deploy
49+
hotfixes by branching off the current deployed commit, and cherry-picking the
50+
fix on top of that. We should also strive to reduce the amount of time PRs sit
51+
in master without being live.
52+
53+
Nobody was paged due to this incident, as our monitoring and alerting system
54+
wasn’t able to catch the problem: we have monitoring in place for jobs failing
55+
to execute, but in this case the job was mistakenly marked as correct. We
56+
should implement periodic checks that ensure the database and the index are
57+
correctly synchronized.
58+
59+
We were lucky that two members of the team with access to both the support
60+
email and the production environment were online during the outage: without
61+
paging available we could’ve noticed it way later than we did.
62+
63+
During the incident investigation we also found that our logging was not good
64+
enough to properly diagnose the problem: there is no message logged when a
65+
commit is pushed to the index, nor when a background job is executed. Also, the
66+
API call to publish new crates doesn’t include the crate name in its line. We
67+
should enhance our logging capabilities to find the root cause of issues
68+
quickly during future incidents.
69+
70+
## Timeline of events
71+
72+
It took 1 hour and 31 minutes from the start of the incident to the deploy of
73+
the fix.
74+
75+
### 2020-02-20
76+
77+
- **21:17 UTC: the authors of `kaze`, `wasmer-runtime-core`
78+
and `wasmer-win-exception-handler` published them on crates.io**
79+
- 21:28 UTC: the author of `wasmer-runtime-core` and
80+
`wasmer-win-exception-handler` reports the issue to [email protected]
81+
- **21:31 UTC: GitHub updates their status page to report an outage**
82+
- 21:33 UTC: Pietro notices the support mail, pings Sean on Discord, Sean
83+
starts investigating
84+
- 21:35 UTC: Pietro got back to the author saying that the team was
85+
investigating
86+
- 21:37 UTC: Sean and Pietro find the symptoms of the incident
87+
- 21:50 UTC: Sean finds a possible cause for the bug
88+
- 22:01 UTC: Sean deletes the affected versions from the database
89+
- 22:09 UTC: Sean opens up [PR 2207][fix] with the fix
90+
- **22:16 UTC: GitHub updates their status page to say the problem is fixed**
91+
- 22:17 UTC: Pietro asks for changes on the PR
92+
- 22:20 UTC: Sean addresses Pietro’s concerns in the PR
93+
- 22:23 UTC: PR merged, Sean deploys it directly to master
94+
- **22:48 UTC: Fix deployed on production**
95+
96+
### 2020-02-21
97+
98+
- 09:27 UTC: the author of `kaze` reports their crate was affected to
99+
100+
- 12:55 UTC: Pietro deletes the affected version of `kaze` from the database
101+
and gets back to the author of the crate
102+
- 14:10 UTC: Pietro analyzes the crates.io database and finds out no other
103+
crate was affected
104+
105+
## Action items
106+
107+
- [#2226]: Add simple logging when we start the index publishing process.
108+
- [#2227]: Add a periodic job that checks the index and the database for
109+
consistency, paging the on-call person if there are any mismatches. The job
110+
will need to account for the crates not yet published on the index but in the
111+
queue.
112+
- [#2228]: Include the crate name in the HTTP log entry for the publish API
113+
call.
114+
- [#2229]: Add in-depth logging for swirl background jobs, with information
115+
such as the job name or the parameters.
116+
- [#2230]: Investigate whether we want to implement a self-healing feature to
117+
automatically synchronize the index in cases of mismatch.
118+
119+
[#2226]: https://github.com/rust-lang/crates.io/issues/2226
120+
[#2227]: https://github.com/rust-lang/crates.io/issues/2227
121+
[#2228]: https://github.com/rust-lang/crates.io/issues/2228
122+
[#2229]: https://github.com/rust-lang/crates.io/issues/2229
123+
[#2230]: https://github.com/rust-lang/crates.io/issues/2230

0 commit comments

Comments
 (0)