You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for the work you guys have put into this research. However, I do have a few pressing questions I cannot find any solutions to, and was hoping somebodoy could give some insight.
What was the positional encoding added to the encoder outputs? My impression is that it was a GSD positional encoding (GSD PE) using the target resolution, not the input resolution. (e.g from Fig 2 in the paper, input_resolution is 0.7m, target_resolution is 0.3m, correct me if I am mistaken?) However, looking at the code, it seems to be calculated using the input resolution of 0.7m. I believed that, for the decoder to produce an accurate reconstruction of the target, the GSD PE should be calculated from the target resolution, not from the input resolution as was done in the code. Is this the correct implementation?
Furthermore, looking into the forward method of the code, specifically at line 362 of scale-mae/mae/models_mae.py of the main branch, I see pos_embed = get_2d_sincos_pos_embed_with_resolution( x.shape[-1], target_dim, target_res, cls_token=True, device=x.device )
Which is then projected under some embedding projection module, but never used for anything else. Is this intentional, or am I just unable to find where it is used? This is one of the reasons why I asked question 1.; it seems like the intention was to get GSD PE with target resolution encoding, but the implementation never showed as such.
As a whole, I would also like to comment that better documentation in place for the backend code, would make it less confusing to traverse, as even having been through the code from the SatMAE paper which this was built upon, I felt loss quite a few times. Regardless, thank you for your hard work and dedication in open-sourcing and publishing your research :)
The text was updated successfully, but these errors were encountered:
Hi, thanks for the work you guys have put into this research. However, I do have a few pressing questions I cannot find any solutions to, and was hoping somebodoy could give some insight.
What was the positional encoding added to the encoder outputs? My impression is that it was a GSD positional encoding (GSD PE) using the target resolution, not the input resolution. (e.g from Fig 2 in the paper, input_resolution is 0.7m, target_resolution is 0.3m, correct me if I am mistaken?) However, looking at the code, it seems to be calculated using the input resolution of 0.7m. I believed that, for the decoder to produce an accurate reconstruction of the target, the GSD PE should be calculated from the target resolution, not from the input resolution as was done in the code. Is this the correct implementation?
Furthermore, looking into the forward method of the code, specifically at line 362 of scale-mae/mae/models_mae.py of the main branch, I see
pos_embed = get_2d_sincos_pos_embed_with_resolution( x.shape[-1], target_dim, target_res, cls_token=True, device=x.device )
Which is then projected under some embedding projection module, but never used for anything else. Is this intentional, or am I just unable to find where it is used? This is one of the reasons why I asked question 1.; it seems like the intention was to get GSD PE with target resolution encoding, but the implementation never showed as such.
As a whole, I would also like to comment that better documentation in place for the backend code, would make it less confusing to traverse, as even having been through the code from the SatMAE paper which this was built upon, I felt loss quite a few times. Regardless, thank you for your hard work and dedication in open-sourcing and publishing your research :)
The text was updated successfully, but these errors were encountered: