|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "crates.io incident report for 2020-02-20" |
| 4 | +author: Pietro Albini |
| 5 | +team: the crates.io team <https://www.rust-lang.org/governance/teams/crates-io> |
| 6 | +--- |
| 7 | + |
| 8 | +On 2020-02-20 at 21:28 UTC we received a report from an user of crates.io that |
| 9 | +their crate was not available on the index even after 10 minutes since the |
| 10 | +upload. This was a bug in the crates.io webapp exposed by a GitHub outage. |
| 11 | + |
| 12 | +## Root cause of the outage |
| 13 | + |
| 14 | +In some corner cases the code that uploads new commits to the GitHub repository |
| 15 | +of the index was returning a successful status even though the push itself |
| 16 | +failed. The bug caused the job scheduler to think the upload was actually |
| 17 | +successful, causing the job to be removed from the queue and producing a data |
| 18 | +loss. |
| 19 | + |
| 20 | +The outage was caused by that bug, triggered by an unexpected response during |
| 21 | +[the GitHub outage][gh-outage] happening at the same time. |
| 22 | + |
| 23 | +[gh-outage]: https://www.githubstatus.com/incidents/bd29l6zgr43g |
| 24 | + |
| 25 | +## Resolution |
| 26 | + |
| 27 | +The team analyzed the code of the background job uploading commits to the |
| 28 | +index, and found a possible cause of the misreported success. A team member |
| 29 | +[wrote the fix][fix], another one reviewed it and we then deployed the patch |
| 30 | +directly to production. |
| 31 | + |
| 32 | +At the same time, once we saw the index started to be updated again, we removed |
| 33 | +the broken entries in the database manually and asked the reporter to upload |
| 34 | +their crates again. |
| 35 | + |
| 36 | +[fix]: https://github.com/rust-lang/crates.io/pull/2207 |
| 37 | + |
| 38 | +## Affected crates |
| 39 | + |
| 40 | +- [`kaze`](https://crates.io/crates/kaze) 0.1.6 |
| 41 | +- [`wasmer-runtime-core`](https://crates.io/crates/wasmer-runtime-core) 0.14.0 |
| 42 | +- [`wasmer-win-exception-handler`](https://crates.io/crates/wasmer-win-exception-handler) 0.14.0 |
| 43 | + |
| 44 | +## Postmortem |
| 45 | + |
| 46 | +Deploying the change took way longer than expected: there were changes landed |
| 47 | +in master but waiting to be deployed on production, increasing the length of |
| 48 | +the build process and the risks of the deploy. In the future we should deploy |
| 49 | +hotfixes by branching off the current deployed commit, and cherry-picking the |
| 50 | +fix on top of that. We should also strive to reduce the amount of time PRs sit |
| 51 | +in master without being live. |
| 52 | + |
| 53 | +Nobody was paged due to this incident, as our monitoring and alerting system |
| 54 | +wasn’t able to catch the problem: we have monitoring in place for jobs failing |
| 55 | +to execute, but in this case the job was mistakenly marked as correct. We |
| 56 | +should implement periodic checks that ensure the database and the index are |
| 57 | +correctly synchronized. |
| 58 | + |
| 59 | +We were lucky that two members of the team with access to both the support |
| 60 | +email and the production environment were online during the outage: without |
| 61 | +paging available we could’ve noticed it way later than we did. |
| 62 | + |
| 63 | +During the incident investigation we also found that our logging was not good |
| 64 | +enough to properly diagnose the problem: there is no message logged when a |
| 65 | +commit is pushed to the index, nor when a background job is executed. Also, the |
| 66 | +API call to publish new crates doesn’t include the crate name in its line. We |
| 67 | +should enhance our logging capabilities to find the root cause of issues |
| 68 | +quickly during future incidents. |
| 69 | + |
| 70 | +## Timeline of events |
| 71 | + |
| 72 | +It took 1 hour and 31 minutes from the start of the incident to the deploy of |
| 73 | +the fix. |
| 74 | + |
| 75 | +### 2020-02-20 |
| 76 | + |
| 77 | +- **21:17 UTC: the authors of `kaze`, `wasmer-runtime-core` |
| 78 | + and `wasmer-win-exception-handler` published them on crates.io** |
| 79 | +- 21:28 UTC: the author of `wasmer-runtime-core` and |
| 80 | + `wasmer-win-exception-handler` reports the issue to [email protected] |
| 81 | +- **21:31 UTC: GitHub updates their status page to report an outage** |
| 82 | +- 21:33 UTC: Pietro notices the support mail, pings Sean on Discord, Sean |
| 83 | + starts investigating |
| 84 | +- 21:35 UTC: Pietro got back to the author saying that the team was |
| 85 | + investigating |
| 86 | +- 21:37 UTC: Sean and Pietro find the symptoms of the incident |
| 87 | +- 21:50 UTC: Sean finds a possible cause for the bug |
| 88 | +- 22:01 UTC: Sean deletes the affected versions from the database |
| 89 | +- 22:09 UTC: Sean opens up [PR 2207][fix] with the fix |
| 90 | +- **22:16 UTC: GitHub updates their status page to say the problem is fixed** |
| 91 | +- 22:17 UTC: Pietro asks for changes on the PR |
| 92 | +- 22:20 UTC: Sean addresses Pietro’s concerns in the PR |
| 93 | +- 22:23 UTC: PR merged, Sean deploys it directly to master |
| 94 | +- **22:48 UTC: Fix deployed on production** |
| 95 | + |
| 96 | +### 2020-02-21 |
| 97 | + |
| 98 | +- 09:27 UTC: the author of `kaze` reports their crate was affected to |
| 99 | + |
| 100 | +- 12:55 UTC: Pietro deletes the affected version of `kaze` from the database |
| 101 | + and gets back to the author of the crate |
| 102 | +- 14:10 UTC: Pietro analyzes the crates.io database and finds out no other |
| 103 | + crate was affected |
| 104 | + |
| 105 | +## Action items |
| 106 | + |
| 107 | +- [#2226]: Add simple logging when we start the index publishing process. |
| 108 | +- [#2227]: Add a periodic job that checks the index and the database for |
| 109 | + consistency, paging the on-call person if there are any mismatches. The job |
| 110 | + will need to account for the crates not yet published on the index but in the |
| 111 | + queue. |
| 112 | +- [#2228]: Include the crate name in the HTTP log entry for the publish API |
| 113 | + call. |
| 114 | +- [#2229]: Add in-depth logging for swirl background jobs, with information |
| 115 | + such as the job name or the parameters. |
| 116 | +- [#2230]: Investigate whether we want to implement a self-healing feature to |
| 117 | + automatically synchronize the index in cases of mismatch. |
| 118 | + |
| 119 | +[#2226]: https://github.com/rust-lang/crates.io/issues/2226 |
| 120 | +[#2227]: https://github.com/rust-lang/crates.io/issues/2227 |
| 121 | +[#2228]: https://github.com/rust-lang/crates.io/issues/2228 |
| 122 | +[#2229]: https://github.com/rust-lang/crates.io/issues/2229 |
| 123 | +[#2230]: https://github.com/rust-lang/crates.io/issues/2230 |
0 commit comments