You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed that in the paper, the relevance score between image patches and tokens are calculated as:
where the postive values of gradients are set to 0 through the min function, leaving only negative values. The reason for doing that can be quoted as:
Inspired by GradCAM, we filter out uninformative attention scores by multiplication with the gradient which could cause an increase in the image-text similarity.
But in your code implementation, a clamp(0) function is applied to gradients that is supposed to assign 0 to negative values. Isn't it actually a max function instead of min? grads = ( grads[:, :, :, 1:].clamp(0).reshape(visual_input.size(0), 12, -1, 24, 24) * mask )
Could anyone provide a explaination? Thanks a lot!
The text was updated successfully, but these errors were encountered:
Hi, thank you for your wonderful work.
I've noticed that in the paper, the relevance score between image patches and tokens are calculated as:

where the postive values of gradients are set to 0 through the min function, leaving only negative values. The reason for doing that can be quoted as:
But in your code implementation, a clamp(0) function is applied to gradients that is supposed to assign 0 to negative values. Isn't it actually a max function instead of min?
grads = ( grads[:, :, :, 1:].clamp(0).reshape(visual_input.size(0), 12, -1, 24, 24) * mask )
Could anyone provide a explaination? Thanks a lot!
The text was updated successfully, but these errors were encountered: