Skip to content

Commit bab9355

Browse files
authored
Merge pull request #1266 from rylev/historical-data-refactor
Refactor handling of historical data
2 parents f3fc582 + de6f7b0 commit bab9355

File tree

2 files changed

+26
-87
lines changed

2 files changed

+26
-87
lines changed

docs/glossary.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,11 +35,10 @@ The following is a glossary of domain specific terminology. Although benchmarks
3535
## Analysis
3636

3737
* **test result delta**: the difference between two test results for the same metric and test case.
38-
* **significant test result delta**: a test result delta that is above some threshold that we have determined to be an actual change in performance and not noise.
39-
* **noisy test case**: a test case for which the non-significant test result deltas are on average above some threshold. Non-noisy data tends to be very close to zero, and so a test case that produces non-significant test result deltas that are not close enough to zero on average are classified as noisy.
38+
* **significance threshold**: the threshold at which a test result delta is considered "significant" (i.e., a real change in performance and not just noise). This is calculated using [the upper IQR fence](https://www.statisticshowto.com/upper-and-lower-fences/#:~:text=Upper%20and%20lower%20fences%20cordon,%E2%80%93%20(1.5%20*%20IQR)) as seen [here](https://github.com/rust-lang/rustc-perf/blob/8ba845644b4cfcffd96b909898d7225931b55557/site/src/comparison.rs#L935-L941).
39+
* **significant test result delta**: a test result delta above the significance threshold. Significant test result deltas can be thought of as "statistically significant".
40+
* **dodgy test case**: a test case for which the significance threshold is significantly large indicating a high amount of variability in the test and thus making it necessary to be somewhat skeptical of any results too close to the significance threshold.
4041
* **relevant test result delta**: a synonym for *significant test result delta* in situations where the term "significant" might be ambiguous and readers may potentially interpret *significant* as "large" or "statistically significant". For example, in try run results, we use the term relevant instead of significant.
41-
* **highly variable test case**: a test case that frequently produces (over some threshold) significant test result deltas. This differs from sensitive benchmarks in that the deltas aren't necessarily large, they just happen more frequently than one would expect, and it is unclear why the significant deltas are happening. Note: highly variable test cases can be classified as noisy. They are different from classically noisy benchmarks in that they often produce deltas that are above the significance threshold. We cannot therefore be 100% if they are variable because they are noisy or because parts of the compiler being exercised by these benchmarks are just being changed very often.
42-
* **highly sensitive benchmark**: a test case that tends to produce large (over some threshold) significant test result deltas. This differs from highly variable test cases in that it is usually clear what type of changes tend to result in significant test result deltas. For example, some of the stress tests are highly sensitive; they produce significant test result deltas when parts of the compiler that they are stressing change and the percentage change is typically quite large.
4342

4443
## Other
4544

site/src/comparison.rs

Lines changed: 23 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -619,7 +619,8 @@ async fn compare_given_commits(
619619
let statistics_for_a = statistics_from_series(&mut responses);
620620
let statistics_for_b = statistics_from_series(&mut responses);
621621

622-
let variances = BenchmarkVariances::calculate(ctxt, a.clone(), master_commits, stat).await?;
622+
let mut historical_data =
623+
HistoricalDataMap::calculate(ctxt, a.clone(), master_commits, stat).await?;
623624
let statistics = statistics_for_a
624625
.into_iter()
625626
.filter_map(|(test_case, a)| {
@@ -629,7 +630,7 @@ async fn compare_given_commits(
629630
benchmark: test_case.0,
630631
profile: test_case.1,
631632
scenario: test_case.2,
632-
variance: variances.data.get(&test_case).cloned(),
633+
historical_data: historical_data.data.remove(&test_case),
633634
results: (a, b),
634635
})
635636
})
@@ -824,14 +825,13 @@ impl Comparison {
824825
}
825826
}
826827

827-
/// A description of the amount of variance a certain benchmark is historically
828-
/// experiencing at a given point in time.
829-
pub struct BenchmarkVariances {
830-
/// Variance data on a per test case basis
831-
pub data: HashMap<(Benchmark, Profile, Scenario), BenchmarkVariance>,
828+
/// The historical data for a certain benchmark
829+
pub struct HistoricalDataMap {
830+
/// Historical data on a per test case basis
831+
pub data: HashMap<(Benchmark, Profile, Scenario), HistoricalData>,
832832
}
833833

834-
impl BenchmarkVariances {
834+
impl HistoricalDataMap {
835835
const NUM_PREVIOUS_COMMITS: usize = 100;
836836
const MIN_PREVIOUS_COMMITS: usize = 50;
837837

@@ -841,18 +841,18 @@ impl BenchmarkVariances {
841841
master_commits: &[collector::MasterCommit],
842842
stat: String,
843843
) -> Result<Self, BoxedError> {
844-
let mut variance_data = HashMap::new();
844+
let mut historical_data = HashMap::new();
845845

846846
let previous_commits = Arc::new(previous_commits(
847847
from,
848848
Self::NUM_PREVIOUS_COMMITS,
849849
master_commits,
850850
));
851851

852-
// Return early if we don't have enough data to calculate variance.
852+
// Return early if we don't have enough data for historical analysis
853853
if previous_commits.len() < Self::MIN_PREVIOUS_COMMITS {
854854
return Ok(Self {
855-
data: variance_data,
855+
data: historical_data,
856856
});
857857
}
858858

@@ -869,37 +869,25 @@ impl BenchmarkVariances {
869869

870870
for _ in previous_commits.iter() {
871871
for (test_case, stat) in statistics_from_series(&mut previous_commit_series) {
872-
variance_data.entry(test_case).or_default().push(stat);
872+
historical_data.entry(test_case).or_default().push(stat);
873873
}
874874
}
875875

876876
// Only retain test cases for which we have enough data to calculate variance.
877-
variance_data.retain(|_, v| v.data.len() >= Self::MIN_PREVIOUS_COMMITS);
878-
879-
for ((bench, _, _), results) in variance_data.iter_mut() {
880-
log::trace!("Calculating variance for: {}", bench);
881-
results.calculate_description();
882-
}
877+
historical_data.retain(|_, v| v.data.len() >= Self::MIN_PREVIOUS_COMMITS);
883878

884879
Ok(Self {
885-
data: variance_data,
880+
data: historical_data,
886881
})
887882
}
888883
}
889884

890885
#[derive(Debug, Default, Clone, Serialize)]
891-
pub struct BenchmarkVariance {
886+
pub struct HistoricalData {
892887
data: Vec<f64>,
893-
description: BenchmarkVarianceDescription,
894888
}
895889

896-
impl BenchmarkVariance {
897-
/// The ratio of change that we consider significant.
898-
const SIGNFICANT_DELTA_THRESHOLD: f64 = 0.01;
899-
/// The percentage of significant changes that we consider too high
900-
const SIGNFICANT_CHANGE_THRESHOLD: f64 = 5.0;
901-
/// The ratio of change that constitutes noisy data
902-
const NOISE_THRESHOLD: f64 = 0.001;
890+
impl HistoricalData {
903891
/// The multiple of the IQR above Q3 that signifies significance
904892
const IQR_MULTIPLIER: f64 = 3.0;
905893

@@ -963,36 +951,6 @@ impl BenchmarkVariance {
963951
(q1, q3)
964952
}
965953

966-
fn calculate_description(&mut self) {
967-
self.description = BenchmarkVarianceDescription::Normal;
968-
969-
let percent_changes = self.percent_changes();
970-
let non_significant = percent_changes
971-
.iter()
972-
.take_while(|&&c| c < Self::SIGNFICANT_DELTA_THRESHOLD)
973-
.collect::<Vec<_>>();
974-
975-
let percent_significant_changes = ((percent_changes.len() - non_significant.len()) as f64
976-
/ percent_changes.len() as f64)
977-
* 100.0;
978-
log::trace!(
979-
"Percent significant changes: {:.1}%",
980-
percent_significant_changes
981-
);
982-
983-
if percent_significant_changes > Self::SIGNFICANT_CHANGE_THRESHOLD {
984-
self.description = BenchmarkVarianceDescription::HighlyVariable;
985-
return;
986-
}
987-
988-
let delta_mean =
989-
non_significant.iter().cloned().sum::<f64>() / (non_significant.len() as f64);
990-
log::trace!("Ratio change: {:.4}", delta_mean);
991-
if delta_mean > Self::NOISE_THRESHOLD {
992-
self.description = BenchmarkVarianceDescription::Noisy;
993-
}
994-
}
995-
996954
// Absolute deltas between adjacent results
997955
fn deltas(&self) -> impl Iterator<Item = f64> + '_ {
998956
self.data
@@ -1008,24 +966,6 @@ impl BenchmarkVariance {
1008966
}
1009967
}
1010968

1011-
#[derive(Debug, Clone, Copy, Serialize)]
1012-
#[serde(tag = "type", content = "percent")]
1013-
pub enum BenchmarkVarianceDescription {
1014-
Normal,
1015-
/// A highly variable benchmark that produces many significant changes.
1016-
/// This might indicate a benchmark which is very sensitive to compiler changes.
1017-
HighlyVariable,
1018-
/// A noisy benchmark which is likely to see changes in performance simply between
1019-
/// compiler runs.
1020-
Noisy,
1021-
}
1022-
1023-
impl Default for BenchmarkVarianceDescription {
1024-
fn default() -> Self {
1025-
Self::Normal
1026-
}
1027-
}
1028-
1029969
/// Gets the previous commit
1030970
pub fn prev_commit<'a>(
1031971
artifact: &ArtifactId,
@@ -1057,14 +997,14 @@ pub struct TestResultComparison {
1057997
benchmark: Benchmark,
1058998
profile: Profile,
1059999
scenario: Scenario,
1060-
variance: Option<BenchmarkVariance>,
1000+
historical_data: Option<HistoricalData>,
10611001
results: (f64, f64),
10621002
}
10631003

10641004
impl TestResultComparison {
10651005
/// The amount of relative change considered significant when
10661006
/// we cannot determine from historical data
1067-
const SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD: f64 = 0.002;
1007+
const DEFAULT_SIGNIFICANCE_THRESHOLD: f64 = 0.002;
10681008

10691009
pub fn benchmark(&self) -> Benchmark {
10701010
self.benchmark
@@ -1086,10 +1026,10 @@ impl TestResultComparison {
10861026

10871027
/// Magnitude of change considered significant
10881028
fn significance_threshold(&self) -> f64 {
1089-
self.variance
1029+
self.historical_data
10901030
.as_ref()
1091-
.map(|v| v.significance_threshold())
1092-
.unwrap_or(Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD)
1031+
.map(|d| d.significance_threshold())
1032+
.unwrap_or(Self::DEFAULT_SIGNIFICANCE_THRESHOLD)
10931033
}
10941034

10951035
/// This is a numeric magnitude of a particular change.
@@ -1148,7 +1088,7 @@ impl TestResultComparison {
11481088
}
11491089

11501090
fn is_dodgy(&self) -> bool {
1151-
self.variance
1091+
self.historical_data
11521092
.as_ref()
11531093
.map(|v| v.is_dodgy())
11541094
.unwrap_or(false)
@@ -1527,7 +1467,7 @@ mod tests {
15271467
benchmark: index.to_string().as_str().into(),
15281468
profile: Profile::Check,
15291469
scenario: Scenario::Empty,
1530-
variance: None,
1470+
historical_data: None,
15311471
results: (before, after),
15321472
});
15331473
}

0 commit comments

Comments
 (0)