Skip to content

feat: Add random state feature. #150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 15, 2025

Conversation

john-halloran
Copy link

  • feat: Added random_state feature for reproducibility.

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

We have to decide how much testing we will add. Ideal is 100% coverage, optimal is probably less.

Maybe write the docstrings so I can understand what the class does, then we can decide what to test?

@john-halloran
Copy link
Author

This is great!

We have to decide how much testing we will add. Ideal is 100% coverage, optimal is probably less.

Maybe write the docstrings so I can understand what the class does, then we can decide what to test?

Thanks, will work on resolving these. To be clear, for things like the docstrings would you prefer I make new PRs, get those merged, then rebase this one, or just add to this existing PR?

@john-halloran
Copy link
Author

For now, I will assume anything given as feedback in this PR could be in scope to include.

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start. I left a couple of comments.

@@ -17,6 +17,64 @@ def __init__(
components=None,
random_state=None,
):
"""Run sNMF based on an ndarray, parameters, and either a number
Copy link
Contributor

@sbillinge sbillinge Jun 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic! Thanks for this. Please see here for our docstring standards, I am not sure if you looked at it:
https://scikit-package.github.io/scikit-package/programming-guides/billinge-group-standards.html#docstrings

For classes it is a bit tricky because what info do we put in the "Class" docstring and what info do we put in the "constructor" (i.e., the __init__()) docstring. After some googling we came up with the breakout that is shown in the DiffractionObjects class that is shown there. We would be after something similar here.

By way of example, I would probably do like this in this case

def SNMFOptimizer:
       '''Configuration and methods to run the stretched NMF algorithm, sNMF

        Instantiating the SNMFOptimizer class runs all the analysis
        immediately. The results can then be accessed as instance attributes
        of the class (X, Y, and A). 

        Please see <reference to paper> for more details about the algorithm.

        Attributes
        -----------
         mm : ndarray
            The array containing the data to be decomposed. Shape is (length_of_signal, number_of_conditions)
        y0 : ndarray
            The array containing initial guesses for the component weights
            at each stretching condition.  Shape is (number_of_components, number_of_conditions
            ...
'''

put future development plans into issues, not in the docstring. Just describe the current behavior. Try and keep it brief but highly informational.

To conform to PEP8 standards I lower-cased the variables. I know they correspond to matrices but we should decide which standard to break. The tie-breaker should probably be scikit-learn. Whatever they do, let's do that. Let's also add a small comment (not in the docstring) to remind ourselves in the future if it breaks PEP8 or it will annoy me every time we revisit it and I will try and change it back......

Conditions on instantiation will go in the constructor docstring.

That one describes the init method so should look more like a function docstring. It would look something like....

    def __init__(mm....)
        '''Initialize a SNMFOptimizer instance and run the optimization
 
       Parameters
       ------------
            mm : ndarray
                The array containing the data to be decomposed. Shape is (length_of_signal, number_of_conditions)
           y0 : ndarray Optional.  Defaults to None
                The array containing initial guesses for the component weights
                at each stretching condition.  Shape is (number_of_components, number_of_conditions
        ...

I think there was some text before about how Y0 was required. But if it is required it may be better to make it a required (positional) variable in the constructor and not have it optional. we can discuss design decisions too if you like.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either Y0 or n_components needs to be provided. Currently, Y0.shape overrides n_components if both are provided, and throws an error if neither are provided. The way scikit-learn is a little more flexible and also allows for an n_components which is different from Y0.shape, although I'm not clear on why you'd want that. But I'm not matching their behavior exactly because the current code doesn't allow that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-learn actually does break PEP8 to upper-case the matrices

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good progress, please see comments.


For more information on sNMF, please reference:
Gu, R., Rakita, Y., Lan, L. et al. Stretched non-negative matrix factorization.
npj Comput Mater 10, 193 (2024). https://doi.org/10.1038/s41524-024-01377-5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would normally do a list of Class attributes here. Everything that is self.something. This is obviously strongly overlapped with the arguments of the constructor, as many of the attributes get defined in the constructor, but logically they are different. Here we list and dsecribe the class attributes, there we describe the init function arguments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on how I'd distinguish the arguments from the attributes. I understand how they are different semantically, but what part of that is necessary to make clear here? Can you give an example? Those have been helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything that is self.something (except for methods which are self.functions() which are not considered attributes) is an attribute. So MM, Y0, X0 are attributes, but also M, N, rng, num_updates etc.

Inside a function or method the parameters are the arguments of the function. so for the __init__() function they will be MM, Y0, X0, A, rho, eta and so on). Some of the descriptions will overlap but for the function argument the user needs to know if it is optional or not, what the default is, and anything else they need to know to successfully instantiate the class. People will generally not see the two docstrings at the same time, so there can be some repetition, but try and keep it short but informative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls confirm with a thumbs up if you saw this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added class attributes. Some of them are for sure redundant but a full cleanup I will save for the next PR.

provided.
The array containing initial guesses for the component weights
at each stretching condition. Shape is (number of components, number of
conditions) Must be provided if n_components is not provided. Will override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normally we would raise an exception if two conflicting things are provided (we don't want to guess which is the right one) unless there is a good functional reason to do it another way. We like to avoid "magic" and the current behavior of the code could be "magic". Please raise an exception unless there is a strong reason to do otherwise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I don't see any reason for them not to match, so now the user will only be allowed to provide one. This isn't what scikit-learn does, but per your suggestion it makes the most sense for now. The new logic will be: "first, check that exclusively one of n_components and Y0 is provided. If not, raise an exception. If n_components is provided, use that to generate a Y0 with the appropriate size."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if scikit-learn does it the "magic" way we may want to/have to conform to that. But for now, let's do it this way.

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good discussion. pls see my comments


For more information on sNMF, please reference:
Gu, R., Rakita, Y., Lan, L. et al. Stretched non-negative matrix factorization.
npj Comput Mater 10, 193 (2024). https://doi.org/10.1038/s41524-024-01377-5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything that is self.something (except for methods which are self.functions() which are not considered attributes) is an attribute. So MM, Y0, X0 are attributes, but also M, N, rng, num_updates etc.

Inside a function or method the parameters are the arguments of the function. so for the __init__() function they will be MM, Y0, X0, A, rho, eta and so on). Some of the descriptions will overlap but for the function argument the user needs to know if it is optional or not, what the default is, and anything else they need to know to successfully instantiate the class. People will generally not see the two docstrings at the same time, so there can be some repetition, but try and keep it short but informative.

@sbillinge
Copy link
Contributor

Thanks, will work on resolving these. To be clear, for things like the docstrings would you prefer I make new PRs, get those merged, then rebase this one, or just add to this existing PR?

Sounds good. Yes, in general, smaller PRs are easier to merge. We never rebase. You may have been using that term loosely, but it has a particular meaning. But yes everything will get merged together.

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to run for a plane. I will make another review later, but this is getting very close now.

be overridden by Y0 if that is provided, but must be provided if no Y0 is
provided.
random_state : int
The seed for the initial matrices used in the optimization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a little unclear. What matrices?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be clearer now.

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstrings are really good now, modulo a few small comments.

I want to understand why the SNMFOptimizer is a class and not a function (or rather a method in the SNMF class). Let's figure this out. It won't create that much work...less actually because we will need fewer docstrings.....so don't panic about wasted effort.



my_model = snmf_class.SNMFOptimizer(MM=MM, Y0=Y0, X0=X0, A=A0, components=2)
my_model = snmf_class.SNMFOptimizer(MM=MM, Y0=Y0, X0=X0, A=A0, n_components=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, it seems that we have already instantiated some kind of SNMF class, then this is doing the optimization. Is there a particular reason why we make this a class and not a function? It feels much more like a function to me. Could you think what the downsides are of making it a function? Is scikit-learn doing some thing like this too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What scikit-learn has is an NMF class which instantiates, and then has the standard set of model class members like fit_transform and such. But to actually "do the math", NMF internally makes a call to non_negative_factorization. So the math should probably be a function, but unless (and until) we are literally merging this into scikit-learn, it's worth having a class. But instantiating the class should not be running math as it does currently. That's going to be the next thing I tackle, but this PR is getting a little long so I'm holding off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's worth having a class

I don't see the value in having this as a class tbh. Can you justify why it is worth having a class for this (when we have already instantiated a higher lever class (I can't easily see in the browser what this class is called) as snmf_class)

@john-halloran
Copy link
Author

john-halloran commented Jun 15, 2025

I am also taking off for a couple of days but this should be in decent shape. Content wise this is all I'd like to add without opening a new PR, but let me know if you see typos or fixes. Otherwise it looks okay to merge on my end. Thanks for all the help.

@sbillinge
Copy link
Contributor

@john-halloran I figured out what was going. on. there was a dirty import in main (as a general matter we rarely/never use import <modeule> which obscured a bit the actual structure and which tricked me. Please ignore my comment about the class. Let's in general, have a conversation about what the class structure and naming should look like, but there is lots of good work on this PR so I will merge it for now.

@sbillinge sbillinge merged commit 373da11 into diffpy:john-development Jun 15, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants