Skip to content

ADJUSTMENT #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

ADJUSTMENT #10

wants to merge 1 commit into from

Conversation

Vannghia69
Copy link
Contributor

Updated:

  • Redefine the "entanglement" function
  • Rewrite the "refine" function
  • Change the name of the new untangle method from "permutations" to "ShUnTan"

Updated:
- Redefine the "entanglement" function
- Rewrite the "refine" function
- Change the name of the new untangle method from "permutations" to "ShUnTan"
Copy link
Owner

@schlegelp schlegelp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR - there is some good stuff there. Unfortunately, you are also making some changes with great impact on how the library works and what it can be used for that I'm not onboard with (see my comments). I'm happy to discuss options on how to proceed though.

method=sort, **sort_kwargs)

fig = pylab.figure(figsize=(8, 8))
def draw_tanglegram(linkage_1, linkage_2, labels1, labels2, color_by_diff=True, dend_kwargs={}):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why you dropped the entire docstring and a couple parameters?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also am I correct in that you want to change the workflow such that people produce the linkage themselves (i.e. no more DataFrames), untangle it and then pass it to the plotting function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. Sorry for dropping some parameters that change the use of the library. I deleted them because I did not use them, but you are right, there should be other options for users.

Yes. The workflow I wanted is that the users produce the linkages first and then use the untangle methods and "draw" function to get the desired tanglegram layout.

"""Untangle dendrogram using a simple random search.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really on board with you remove the empty lines in the docstrings.

often one will want to use 0, 1, 1.5 or 2:
``sum(abs(x-y)^L)``.

def entanglement(link1, link2):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like L is still accepted (and other functions use it as parameter) but entanglement now ignores it and just does squared distance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my mistake. L should be chosen to be 0, 1, 1.5 or 2. Because I always use L = 2 so I make it unchanged. I am fixing it.


exist_in_both = list(set(lindex1) & set(lindex2))
ix = np.arange(max(len(lindex1), len(lindex2)))

if not exist_in_both:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are not using labels but just indices, then there is no point in checking if they exist in both. Or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "leaves_list" function returns the list of leaves' indices. So we only work with indices. In doing so we have to assume that the relationship between indices and labels is one-to-one and identical in both trees.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair but I want/need to cater for scenarios where that's not the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree. We should use labels instead of indices.

index=labelsB)
# Mapping the "number" (1 til tree size) in the left tree with the right tree
matching_leaf_vector = np.zeros(max(len(lindex1), len(lindex2)))
for i in lindex2:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested it properly but this for loop can't possibly be faster than the previous array-based solution. Could you elaborate a bit on what the advantage of doing it this way is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous array-based solution is actually not correct. Here we want to match the "number" (1 til tree size) in the left tree with the right tree and then calculate the difference between these numbers in two trees, not to compute the difference between indices. Do you apply such matching with "dict" which I do not understand honestly?

The old calculation leads to different results compared to R language.

Copy link
Owner

@schlegelp schlegelp Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree re the existing solution being incorrect - it might not yield exactly the same results as in R but certainly does the same job. Using {label: index} dicts is necessary for scenarios where labels in both dendrograms don't match up 1:1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is necessary to use such dictionary. But I am not sure it matches the numbers (from 1 to tree size) in the left tree with the right tree. What I wanted can be illustrated in the following example:
Left tree: A D E C F
Right tree: C D A F E
Giving objects in the left tree numbers from 1 to tree size yields: 1, 2, 3, 4, 5
Matching these numbers with the right tree: 4, 2, 1, 5, 3

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty much what the existing function does with the dict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then that's my bad not realizing it.

return best_linkage1, best_linkage2, min_entang, improved


def untangle(link1, link2, labels1, labels2, method='random', L=1.5, **kwargs):
def untangle(link1, link2, labels1, labels2, method='random', L=2.0, **kwargs):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like labels are still accepted but essentially ignored in favour of just using the linkage. This implies that the labels in each linkage always match perfectly (i.e. index left 1 = index 1 right and so on). This may work for toy examples but will not be true for most real world examples. This change is unfortunately a deal breaker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is assumed that the set of objects (labels) in two dendrograms have a one-to-one correspondence. Such case occasionally occurs in real life when we apply different hierarchical clustering algorithms on the same dataset.

@@ -720,7 +620,6 @@ def shuffle_dendogram(link, copy=True):

def leaf_order(link, labels=None, as_dict=True):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find the entanglement acting so weird. I might know the reason. The problem comes from the leaf_order function.

  • leafs_ix = sclust.hierarchy.leaves_list(link) returns a list of indices of objects as they appear in the dendrogram (these indices are corresponding with the indices in "labels")
  • if as_dict: if not isinstance(labels, type(None)): return dict(zip(labels, leafs_ix))
    matches the labels in "labels" with indices in sclust.hierarchy.leaves_list(link)
  • However, the orders of objects in "labels" and in the x-axis of the dendrogram are different, so the matching is wrong.

I can give you an example via email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants