-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: EnableCompression seems to have no effect #5050
Comments
random:
locality:
|
hi @adsharma did you try with compression disabled and see what the binary sizes are? |
@ray6080 no difference in kuzu, but parquet export shows the delta:
File sizes:
|
I repeated the experiment with a larger dataset. random_parq.py locality_parq.py duckdb queries:
sizes:
Kuzu:
Parquet:
I didn't try to sort the input with kuzu like I did with duckdb.
|
I think the biggest issue is that for some reason we're calculating a max value of Additionally, the IDs we use internally do not correspond to the primary key IDs of the nodes, and relationship data is stored using the internal IDs, not the primary keys., which means that there is very little locality unless you create the database (or at least the node table) in single-threaded mode, which should keep the internal IDs more or less consistent with the IDs you've provided. Similarly, the relationship are not ordered in your example, so the chunks storing relationship IDs have very little locality (though due to a bug mentioned below you will see almost no improvement from sorting them at the moment). There also appears to be an issue with how we're calculating the minimum values in a chunk which is losing us a few bits of compression. According to the Finally, the |
Thank you for looking into this! After applying the two changes above:
I do see an improvement:
But I worry that this is still ~8x larger than duckdb at 21M. Re: kuzu using internal IDs different from primary key This is a great idea. It allows you to reorder the graph for much better compression without worrying about breaking user specified keys. For example, this paper from 2016 shows you can get it down to 8 bits/edge for a billion node graph. These numbers are from a graph partitioning algorithm run on a CPU. With Also see related Hopefully after fixing kuzu specific issues you note above, this particular test case can be brought down to 10 million * 8 bits/edge =~ 10MB. |
Hi @adsharma - thanks for using Kuzu and contributing to it with the data and links above. Since you seem like a pretty sophisticated user, I wanted to also share our discord with you. You might find it useful for posting questions: https://kuzudb.com/chat You're also welcome to reach out to us through [email protected] if you want to dig deeper in person. |
Kuzu version
v0.8.2
What operating system are you using?
MacOS Sequoia
What happened?
I created 2 databases, each with 10k nodes, 100k rels
random: the rels were randomly distributed between nodes
locality: the rels were local within a window of 100 node ids
I was expecting different db sizes due to these distributions and my reading of the code (
enableCompression=true
by default and you're doingIntegerBitPacking
).However, the persistent database sizes were identical:
7.6M locality_distrib
7.6M random_distrib
Should I repeat the test with a larger db sizes? Are there any other configs that need to be tweaked before I see difference in performance between the two configs?
Are there known steps to reproduce?
No response
The text was updated successfully, but these errors were encountered: