Skip to content

fix variance for identical values #554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

bournejt
Copy link

Mimic pandas implementation to make variance to be 0 when all the values in the window are identical. This avoids critique numerical errors in variance calculations.

@timkpaine timkpaine marked this pull request as draft June 26, 2025 16:57
@timkpaine timkpaine marked this pull request as draft June 26, 2025 16:57
@timkpaine timkpaine added type: enhancement Issues and PRs related to improvements to existing features labels Jun 26, 2025
Copy link
Collaborator

@AdamGlustein AdamGlustein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for the contribution! Left some feedback on the implementation, when you get time let me know what you think.

Also, would you be able to add a test case under test_stats.py that verifies this behavior? You could re-use the same reproduction you had in the Discussion, and just assert the variance/weighted variance is exactly 0.

@@ -435,6 +463,8 @@ class Variance
double m_dx;
double m_count;
int64_t m_ddof;
double m_lastValue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment explaining these variables are for the special case where all values in a window are the same.

m_lastValue = x;
m_consecutiveSameCount = 1;
}
else if( x == m_lastValue )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do

if( x == m_lastValue && m_count > 1 )
    ...
else
    ...

Also minor style comment but we usually don't include braces for one-line if-statement bodies.

@@ -397,6 +397,21 @@ class Variance
void add( double x )
{
m_count++;
// Track consecutive same values (pandas approach)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add to the comment why we track the values (for the case all values are the same and we want to avoid floating-point errors).

return;
}
// Reset consecutive tracking since we can't maintain it accurately during removal
m_consecutiveSameCount = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's incorrect to reset the count here, consider a window [1, 1, 1, 1] with interval=3. At t=4, when the first value is removed we are going to set m_consecutiveSameCount to zero, even though it should be 3.

We actually don't need to do anything in remove to get this functionality to work, since we are checking m_consecutiveSameCount >= m_count in compute (note the >=). So the only event that needs to resets the consecutive count is an addition of a new value.

@@ -435,6 +463,8 @@ class Variance
double m_dx;
double m_count;
int64_t m_ddof;
double m_lastValue;
int64_t m_consecutiveSameCount;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we rename this to m_consecutiveValueCount

@@ -455,6 +485,22 @@ class WeightedVariance
{
if( w <= 0 )
return;
// Track consecutive same values and observation count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the same comments apply for WeightedVariance as for Variance

}

double compute() const
{
if( m_count > m_ddof )
{
// Check if all values are identical (pandas approach)
if( m_count == 1 || m_consecutiveSameCount >= m_count )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check m_count here, if m_count == 1 then its guaranteed m_consecutiveSameCount >= 1 and thus the condition will be hit anyways.

@@ -494,6 +554,9 @@ class WeightedVariance
double m_unnormWVar;
double m_dx;
int64_t m_ddof;
int64_t m_count;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove m_count here (see the prior comment about how its not needed in the if-check in compute).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Issues and PRs related to improvements to existing features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants