-
Notifications
You must be signed in to change notification settings - Fork 59
fix variance for identical values #554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton for the contribution! Left some feedback on the implementation, when you get time let me know what you think.
Also, would you be able to add a test case under test_stats.py
that verifies this behavior? You could re-use the same reproduction you had in the Discussion, and just assert the variance/weighted variance is exactly 0.
@@ -435,6 +463,8 @@ class Variance | |||
double m_dx; | |||
double m_count; | |||
int64_t m_ddof; | |||
double m_lastValue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment explaining these variables are for the special case where all values in a window are the same.
m_lastValue = x; | ||
m_consecutiveSameCount = 1; | ||
} | ||
else if( x == m_lastValue ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do
if( x == m_lastValue && m_count > 1 )
...
else
...
Also minor style comment but we usually don't include braces for one-line if-statement bodies.
@@ -397,6 +397,21 @@ class Variance | |||
void add( double x ) | |||
{ | |||
m_count++; | |||
// Track consecutive same values (pandas approach) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add to the comment why we track the values (for the case all values are the same and we want to avoid floating-point errors).
return; | ||
} | ||
// Reset consecutive tracking since we can't maintain it accurately during removal | ||
m_consecutiveSameCount = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's incorrect to reset the count here, consider a window [1, 1, 1, 1] with interval=3. At t=4, when the first value is removed we are going to set m_consecutiveSameCount to zero, even though it should be 3.
We actually don't need to do anything in remove
to get this functionality to work, since we are checking m_consecutiveSameCount >= m_count
in compute
(note the >=). So the only event that needs to resets the consecutive count is an addition of a new value.
@@ -435,6 +463,8 @@ class Variance | |||
double m_dx; | |||
double m_count; | |||
int64_t m_ddof; | |||
double m_lastValue; | |||
int64_t m_consecutiveSameCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we rename this to m_consecutiveValueCount
@@ -455,6 +485,22 @@ class WeightedVariance | |||
{ | |||
if( w <= 0 ) | |||
return; | |||
// Track consecutive same values and observation count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the same comments apply for WeightedVariance
as for Variance
} | ||
|
||
double compute() const | ||
{ | ||
if( m_count > m_ddof ) | ||
{ | ||
// Check if all values are identical (pandas approach) | ||
if( m_count == 1 || m_consecutiveSameCount >= m_count ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check m_count
here, if m_count == 1
then its guaranteed m_consecutiveSameCount >= 1
and thus the condition will be hit anyways.
@@ -494,6 +554,9 @@ class WeightedVariance | |||
double m_unnormWVar; | |||
double m_dx; | |||
int64_t m_ddof; | |||
int64_t m_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove m_count
here (see the prior comment about how its not needed in the if-check in compute
).
Mimic pandas implementation to make variance to be 0 when all the values in the window are identical. This avoids critique numerical errors in variance calculations.