Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: EnableCompression seems to have no effect #5050

Open
adsharma opened this issue Mar 13, 2025 · 7 comments
Open

Bug: EnableCompression seems to have no effect #5050

adsharma opened this issue Mar 13, 2025 · 7 comments
Labels
bug Something isn't working

Comments

@adsharma
Copy link

Kuzu version

v0.8.2

What operating system are you using?

MacOS Sequoia

What happened?

I created 2 databases, each with 10k nodes, 100k rels

random: the rels were randomly distributed between nodes
locality: the rels were local within a window of 100 node ids

I was expecting different db sizes due to these distributions and my reading of the code (enableCompression=true by default and you're doing IntegerBitPacking).

However, the persistent database sizes were identical:

7.6M locality_distrib
7.6M random_distrib

Should I repeat the test with a larger db sizes? Are there any other configs that need to be tweaked before I see difference in performance between the two configs?

Are there known steps to reproduce?

No response

@adsharma adsharma added the bug Something isn't working label Mar 13, 2025
@adsharma
Copy link
Author

random:

kuzu> match (a:Node)-[b:Relates]-(c:Node) where a.id=100 return b;
┌────────────────────────────────────────────────────┐
│ b                                                  │
│ REL                                                │
├────────────────────────────────────────────────────┤
│ (0:100)-{_LABEL: Relates, _ID: 1:12838}->(0:467)   │
│ (0:100)-{_LABEL: Relates, _ID: 1:22748}->(0:7029)  │
│ (0:100)-{_LABEL: Relates, _ID: 1:28074}->(0:962)   │
│ (0:100)-{_LABEL: Relates, _ID: 1:25881}->(0:1843)  │
│ (0:100)-{_LABEL: Relates, _ID: 1:32690}->(0:4104)  │
│ (0:100)-{_LABEL: Relates, _ID: 1:30597}->(0:5167)  │
│ (0:100)-{_LABEL: Relates, _ID: 1:56943}->(0:3580)  │
│ (0:100)-{_LABEL: Relates, _ID: 1:80084}->(0:2077)  │
│ (0:100)-{_LABEL: Relates, _ID: 1:88462}->(0:9238)  │
│ (0:100)-{_LABEL: Relates, _ID: 1:106752}->(0:1364) │

locality:

kuzu> match (a:Node)-[b:Relates]-(c:Node) where a.id=100 return b;
┌───────────────────────────────────────────────────┐
│ b                                                 │
│ REL                                               │
├───────────────────────────────────────────────────┤
│ (0:100)-{_LABEL: Relates, _ID: 1:18086}->(0:106)  │
│ (0:100)-{_LABEL: Relates, _ID: 1:11917}->(0:93)   │
│ (0:100)-{_LABEL: Relates, _ID: 1:14547}->(0:129)  │
│ (0:100)-{_LABEL: Relates, _ID: 1:15188}->(0:59)   │
│ (0:100)-{_LABEL: Relates, _ID: 1:44913}->(0:95)   │
│ (0:100)-{_LABEL: Relates, _ID: 1:56435}->(0:125)  │
│ (0:100)-{_LABEL: Relates, _ID: 1:84433}->(0:119)  │
│ (0:100)-{_LABEL: Relates, _ID: 1:85369}->(0:112)  │
│ (0:100)-{_LABEL: Relates, _ID: 1:94158}->(0:91)   │
│ (0:100)-{_LABEL: Relates, _ID: 1:107098}->(0:139) │
│ (0:100)-{_LABEL: Relates, _ID: 1:107642}->(0:52)  │
│ (0:169)-{_LABEL: Relates, _ID: 1:12648}->(0:100)  │
│ (0:91)-{_LABEL: Relates, _ID: 1:27207}->(0:100)   │

@adsharma adsharma changed the title Bug: Bug: EnableCompression seems to have no effect Mar 13, 2025
@ray6080
Copy link
Contributor

ray6080 commented Mar 13, 2025

hi @adsharma did you try with compression disabled and see what the binary sizes are?

@adsharma
Copy link
Author

@ray6080 no difference in kuzu, but parquet export shows the delta:

du -sh *export
364K	locality_export
500K	random_export
du -sh *compressed;
4.1M	locality_distrib_compressed
4.1M	locality_distrib_uncompressed
4.1M	random_distrib_compressed
4.1M	random_distrib_uncompressed

File sizes:

locality_distrib_compressed:
total 8464
-rw-r--r--  1 arun  staff      429 Mar 13 13:56 catalog.kz
-rw-r--r--  1 arun  staff  1732608 Mar 13 13:56 data.kz
-rw-r--r--  1 arun  staff     1239 Mar 13 13:56 metadata.kz
-rw-r--r--  1 arun  staff  2117632 Mar 13 13:56 n-0.hindex

locality_distrib_uncompressed:
total 8464
-rw-r--r--  1 arun  staff      429 Mar 13 13:55 catalog.kz
-rw-r--r--  1 arun  staff  1732608 Mar 13 13:55 data.kz
-rw-r--r--  1 arun  staff     1239 Mar 13 13:55 metadata.kz
-rw-r--r--  1 arun  staff  2117632 Mar 13 13:55 n-0.hindex

random_distrib_compressed:
total 8464
-rw-r--r--  1 arun  staff      429 Mar 13 13:56 catalog.kz
-rw-r--r--  1 arun  staff  1732608 Mar 13 13:56 data.kz
-rw-r--r--  1 arun  staff     1239 Mar 13 13:56 metadata.kz
-rw-r--r--  1 arun  staff  2117632 Mar 13 13:56 n-0.hindex

random_distrib_uncompressed:
total 8464
-rw-r--r--  1 arun  staff      429 Mar 13 13:54 catalog.kz
-rw-r--r--  1 arun  staff  1732608 Mar 13 13:54 data.kz
-rw-r--r--  1 arun  staff     1239 Mar 13 13:54 metadata.kz
-rw-r--r--  1 arun  staff  2117632 Mar 13 13:54 n-0.hindex

@adsharma
Copy link
Author

I repeated the experiment with a larger dataset.
TL;DR: when imported into duckdb, locality is smaller than random as expected. But with kuzu, it's the other way around

random_parq.py
https://gist.github.com/adsharma/0f47c379db3ba1dae2078d24ec41f4e5

locality_parq.py
https://gist.github.com/adsharma/4d1af45889c27b598fc94626f7a3e4e3

duckdb queries:

create table rel as select * from read_parquet('random_distrib/relationships_combined.parquet') order by from_id, to_id;
create table rel as select * from read_parquet('locality_distrib/relationships_combined.parquet') order by from_id, to_id;

sizes:

du -sh *.db
 21M	local_duck.db
 28M	random_duck.db

Kuzu:

du -sh *kuzu
213M	local_kuzu
211M	random_kuzu

Parquet:

 du -sh *_distrib
 65M	locality_distrib
 64M	random_distrib

I didn't try to sort the input with kuzu like I did with duckdb.

kuzu> COPY Relates from "random_distrib/relationships_combined.parquet";
kuzu> COPY Relates from "locality_distrib/relationships_combined.parquet";

@andyfengHKU andyfengHKU mentioned this issue Mar 17, 2025
45 tasks
@benjaminwinger
Copy link
Collaborator

I think the biggest issue is that for some reason we're calculating a max value of 18446744073709551615 in a bunch of the neighbour ID data chunks (and I also saw 131669000474944 and 131669274923568), which is effectively disabling the compression. The number of data chunks affected seems to be fairly random, and it has more of an impact than the difference in locality in your examples, which is presumably why you saw the random one being smaller than the local one.
Interestingly, that goes away when I run it in single-threaded mode (so at the moment I think you can mostly workaround this issue by using single-threaded mode to create the database).

Additionally, the IDs we use internally do not correspond to the primary key IDs of the nodes, and relationship data is stored using the internal IDs, not the primary keys., which means that there is very little locality unless you create the database (or at least the node table) in single-threaded mode, which should keep the internal IDs more or less consistent with the IDs you've provided.

Similarly, the relationship are not ordered in your example, so the chunks storing relationship IDs have very little locality (though due to a bug mentioned below you will see almost no improvement from sorting them at the moment).

There also appears to be an issue with how we're calculating the minimum values in a chunk which is losing us a few bits of compression. According to the storage_info, the minimum values for all the NBR_ID chunks is 0, which shouldn't be right, and that means that after the first one, they are being compressed to 19 or 20 bits per value instead of 18 since they don't take advantage of the frame of reference compression (I'd tried this with just one of the relationship parquet files since the script wasn't creating a combined file; for more values the number of bits per value will keep going up). I suspect that this is unrelated to the high max values initially mentioned and probably just a problem with not skipping null values since this also occurs when the database is created with just a single thread. The frame of reference compression is really the only way we take advantage of the locality in your example, since bitpacking by itself won't do much for chunks where the values are all large.

Finally, the Relates string property (same for each relationship) is using about 41 pages to store the single string value for each node group (~15% of the total data file size). We could be more aggressive with keeping the initial size of the string data chunk small (particularly since we've started working on reclaiming space, so frequently re-writing a chunk as its size grows will be less of a problem).

@adsharma
Copy link
Author

adsharma commented Mar 17, 2025

Thank you for looking into this! After applying the two changes above:

  • Drop the Relates string property
  • Use THREADS=1

I do see an improvement:

du -sh local_kuzu*
213M	local_kuzu
177M	local_kuzu2

But I worry that this is still ~8x larger than duckdb at 21M.

Re: kuzu using internal IDs different from primary key

This is a great idea. It allows you to reorder the graph for much better compression without worrying about breaking user specified keys. For example, this paper from 2016 shows you can get it down to 8 bits/edge for a billion node graph.

These numbers are from a graph partitioning algorithm run on a CPU. With cugraph and GPUs, I believe such algorithms can be run on a single GPU box much faster and the resulting database can be copied to cheaper CPU-only boxes and run with much higher performance due to the compression.

Also see related cugraph issue.

Hopefully after fixing kuzu specific issues you note above, this particular test case can be brought down to 10 million * 8 bits/edge =~ 10MB.

@aracardan
Copy link

Hi @adsharma - thanks for using Kuzu and contributing to it with the data and links above. Since you seem like a pretty sophisticated user, I wanted to also share our discord with you. You might find it useful for posting questions: https://kuzudb.com/chat

You're also welcome to reach out to us through [email protected] if you want to dig deeper in person.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants