-
Notifications
You must be signed in to change notification settings - Fork 5
ADJUSTMENT #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
ADJUSTMENT #10
Conversation
Updated: - Redefine the "entanglement" function - Rewrite the "refine" function - Change the name of the new untangle method from "permutations" to "ShUnTan"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR - there is some good stuff there. Unfortunately, you are also making some changes with great impact on how the library works and what it can be used for that I'm not onboard with (see my comments). I'm happy to discuss options on how to proceed though.
method=sort, **sort_kwargs) | ||
|
||
fig = pylab.figure(figsize=(8, 8)) | ||
def draw_tanglegram(linkage_1, linkage_2, labels1, labels2, color_by_diff=True, dend_kwargs={}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why you dropped the entire docstring and a couple parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also am I correct in that you want to change the workflow such that people produce the linkage themselves (i.e. no more DataFrames), untangle it and then pass it to the plotting function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. Sorry for dropping some parameters that change the use of the library. I deleted them because I did not use them, but you are right, there should be other options for users.
Yes. The workflow I wanted is that the users produce the linkages first and then use the untangle methods and "draw" function to get the desired tanglegram layout.
"""Untangle dendrogram using a simple random search. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really on board with you remove the empty lines in the docstrings.
often one will want to use 0, 1, 1.5 or 2: | ||
``sum(abs(x-y)^L)``. | ||
|
||
def entanglement(link1, link2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like L
is still accepted (and other functions use it as parameter) but entanglement
now ignores it and just does squared distance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my mistake. L should be chosen to be 0, 1, 1.5 or 2. Because I always use L = 2 so I make it unchanged. I am fixing it.
|
||
exist_in_both = list(set(lindex1) & set(lindex2)) | ||
ix = np.arange(max(len(lindex1), len(lindex2))) | ||
|
||
if not exist_in_both: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are not using labels but just indices, then there is no point in checking if they exist in both. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "leaves_list" function returns the list of leaves' indices. So we only work with indices. In doing so we have to assume that the relationship between indices and labels is one-to-one and identical in both trees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair but I want/need to cater for scenarios where that's not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree. We should use labels instead of indices.
index=labelsB) | ||
# Mapping the "number" (1 til tree size) in the left tree with the right tree | ||
matching_leaf_vector = np.zeros(max(len(lindex1), len(lindex2))) | ||
for i in lindex2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested it properly but this for loop can't possibly be faster than the previous array-based solution. Could you elaborate a bit on what the advantage of doing it this way is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous array-based solution is actually not correct. Here we want to match the "number" (1 til tree size) in the left tree with the right tree and then calculate the difference between these numbers in two trees, not to compute the difference between indices. Do you apply such matching with "dict" which I do not understand honestly?
The old calculation leads to different results compared to R language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree re the existing solution being incorrect - it might not yield exactly the same results as in R but certainly does the same job. Using {label: index}
dicts is necessary for scenarios where labels in both dendrograms don't match up 1:1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is necessary to use such dictionary. But I am not sure it matches the numbers (from 1 to tree size) in the left tree with the right tree. What I wanted can be illustrated in the following example:
Left tree: A D E C F
Right tree: C D A F E
Giving objects in the left tree numbers from 1 to tree size yields: 1, 2, 3, 4, 5
Matching these numbers with the right tree: 4, 2, 1, 5, 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty much what the existing function does with the dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then that's my bad not realizing it.
return best_linkage1, best_linkage2, min_entang, improved | ||
|
||
|
||
def untangle(link1, link2, labels1, labels2, method='random', L=1.5, **kwargs): | ||
def untangle(link1, link2, labels1, labels2, method='random', L=2.0, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like labels
are still accepted but essentially ignored in favour of just using the linkage. This implies that the labels in each linkage always match perfectly (i.e. index left 1 = index 1 right and so on). This may work for toy examples but will not be true for most real world examples. This change is unfortunately a deal breaker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is assumed that the set of objects (labels) in two dendrograms have a one-to-one correspondence. Such case occasionally occurs in real life when we apply different hierarchical clustering algorithms on the same dataset.
@@ -720,7 +620,6 @@ def shuffle_dendogram(link, copy=True): | |||
|
|||
def leaf_order(link, labels=None, as_dict=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still find the entanglement acting so weird. I might know the reason. The problem comes from the leaf_order function.
leafs_ix = sclust.hierarchy.leaves_list(link)
returns a list of indices of objects as they appear in the dendrogram (these indices are corresponding with the indices in "labels")if as_dict: if not isinstance(labels, type(None)): return dict(zip(labels, leafs_ix))
matches the labels in "labels" with indices in sclust.hierarchy.leaves_list(link)- However, the orders of objects in "labels" and in the x-axis of the dendrogram are different, so the matching is wrong.
I can give you an example via email.
Updated: