Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tutorial documentation #353

Open
prrao87 opened this issue Feb 3, 2025 · 5 comments
Open

Improve tutorial documentation #353

prrao87 opened this issue Feb 3, 2025 · 5 comments
Assignees

Comments

@prrao87
Copy link
Member

prrao87 commented Feb 3, 2025

Users have been asking for more tutorials and examples in other languages than Python.

I propose that we update the Tutorials section in our docs to demonstrate the versatility of Kùzu to be used in various client languages. We need to showcase the same workflow, on the same dataset, highlighting that Kùzu caters to users coming from almost any language.

Subtask 1

First, we need to create an artificial dataset that clearly demonstrates the benefits of using a graph to answer the following kinds of queries.

  • 2-hop queries in graphs
  • Aggregation (Cypher doesn't have a CROUP BY clause, so we need to show how you can aggregate on a particular property while grouping on another)
  • Shortest paths using our convenient SHORTEST keyword in Cypher

Subtask 2

Write tutorials in each client language that showcases the end-to-end workflows in each client language that we officially support. We would read in data from CSV/Parquet files and create individual sub-issues linked to this issue that various team members can take on.

cc @aracardan @WWW0030

@WWW0030 WWW0030 self-assigned this Feb 3, 2025
@WWW0030
Copy link
Contributor

WWW0030 commented Feb 3, 2025

I think that we can use an artificial dataset that represents a community of twitter users to demonstrate these benefits.

Dataset: Twitter community

Nodes:

User

  • userId (INT64 PRIMARY KEY)
  • username (STRING)
  • account_creation_date (DATE)

Posts

  • postId (INT64 PRIMARY KEY)
  • post_date (DATE)
  • like_amount (INT64)
  • retweet_amount (INT64)

Relations:

Follows (FROM User TO User)
Posts (FROM User TO Posts)
Likes (FROM User TO Posts)

Queries

  • 2-hop queries can recommend users on who to follow via their follower's follower
  • Aggregation can return statstics of the users in the group, for example, the average amount of followers/followees each user has
  • Shortest path can return the shortest path from user A to user B.

Other suggested queries:

  • We should start the query off with a couple of queries which can be both achieved by SQL and graph query. I think that this can allow some users to start in more familar territories and also provide basic syntax of cypher for someone with SQL knowledge.
  • We can also add a bit more complicated queries. One query which I think can extensively show the power of graph queries is to use the graph given to generate a personal recommendation page for a specific User (By matching their followers' posts and liked posts, and ordering them by like counts).

I also attached the csvs to be used in the tutorial, cc @prrao87 please take a look 👍
tutorial_likes.csv.csv
tutorial_posts.csv.csv
tutorial_tweets.csv.csv
tutorial_users.csv.csv
tutorial_follows.csv.csv

@prrao87 prrao87 self-assigned this Feb 3, 2025
@WWW0030
Copy link
Contributor

WWW0030 commented Feb 4, 2025

Here is a rough draft of what RUST's tutorial queries will look like, I think we should:

  1. Break down a query step by step, explaining what each step does
  2. Link the query syntax used in each query to their respective documents so that the user can easily refer to them for more context

Any other suggestions can also be helpful!

use kuzu::{Connection, Database, Error, SystemConfig};

fn main() -> Result<(), Error> {
    // Create an empty on-disk database and connect to it
    let db = Database::new("./demo_db", SystemConfig::default())?;
    let conn = Connection::new(&db)?;

    // Create the tables
    conn.query("CREATE NODE TABLE User(userId INT64 PRIMARY KEY, username STRING, account_creation_date DATE)")?;
    conn.query("CREATE NODE TABLE User_Post(postId INT64 PRIMARY KEY, post_date DATE, like_count INT64, retweet_count INT64)")?;
    conn.query("CREATE REL TABLE FOLLOWS(FROM User TO User)")?;
    conn.query("CREATE REL TABLE POSTS(FROM User TO User_Post)")?;
    conn.query("CREATE REL TABLE LIKES(FROM User TO User_Post)")?;

    conn.query("COPY User FROM './data/tutorial_user.csv'")?;
    conn.query("COPY User_Post FROM './data/tutorial_user_post.csv'")?;
    conn.query("COPY FOLLOWS FROM './data/TUTORIAL_FOLLOWS.csv'")?;
    conn.query("COPY POSTS FROM './data/TUTORIAL_POSTS.csv'")?;
    conn.query("COPY LIKES FROM './data/TUTORIAL_LIKES.csv'")?;

    // Querying a two-hop statement, giving user recommended follows:
    // First, we want to query for users that we follow follows. We should start off with a query which looks like this:
    conn.query("""
        MATCH (u1:User)-[f1:FOLLOWS]->(u2:User)-[f2:FOLLOWS]->(u3:User)
        RETURN u3
        """)?;

    // Adding onto the query, we want to specify the u1 to be the user we wish to recommend to. We use a WHERE Clause to do so:
    conn.query("""
        MATCH (u1:User)-[f1:FOLLOWS]->(u2:User)-[f2:FOLLOWS]->(u3:User)
        WHERE u1.username = 'epicking81'
        RETURN u3
    """)?;
    
    // This is still not entirely correct, since u3 can return users which u1 already follow. As a last step, we need to expand the WHERE Clause:
    conn.query("""
        MATCH (u1:User)-[f1:FOLLOWS]->(u2:User)-[f2:FOLLOWS]->(u3:User)
        WHERE u1.username = 'epicking81'
        AND NOT (u1)-[:FOLLOWS]->(u3)
        RETURN u3
    """)?;

    // Querying for stats by aggregation, giving the number of people a user follows:
    // Similar to above, we wish to first specify the relationship. In this case, we want to know how many people a specific user follows:
    conn.query("""
        MATCH (u1:User)-[f:FOLLOWS]->(u2:User)
        WHERE u1.username = 'epicking81'
        RETURN u2
    """)?;

    // The previous query will return the list of users our user follows. We can alter the query to use aggregation to return the count instead:
    conn.query("""
        MATCH (u1:User)-[f:FOLLOWS]->(u2:User)
        WHERE u1.username = 'epicking81'
        RETURN count(u2)
    """)?;

    // This is extremely useful in multiple scenarios! Here are some more examples:

    // 1. Querying for average like count of a user:
    conn.query("""
        MATCH (u1:User)-[p:POSTS]->(p2:User_Post)
        WHERE u1.username = 'epicking81'
        RETURN avg(p2.like_count)
    """)?;

    // 2. Querying for max like count of a user:
    conn.query("""
        MATCH (u1:User)-[p:POSTS]->(p2:User_Post)
        WHERE u1.username = 'epicking81'
        RETURN max(p2.like_count)
    """)?;

    // Querying for shortest path
    // We can use recursive matching to find paths between nodes, an example of this showing the shortest length between two users:
    conn.query("""
        MATCH (u1:user)-[f:FOLLOWS* SHORTEST 1..4]->(u2:User)
        WHERE u1.username = 'silentguy245' AND u2.username = 'epicwolf202'
        RETURN length(f) AS length;
    """)?;

    // Recommendation page for user:
    conn.query("""
        MATCH (u1:user)-[f:FOLLOWS]->(u2:User)-[]->(p:User_Post)
        WHERE p.post_date > "2022-01-01" AND u1.username = 'fastgirl798'
        RETURN p.*
        ORDER BY p.like_count DESC LIMIT 10;
    """)?;
}

@prrao87
Copy link
Member Author

prrao87 commented Feb 5, 2025

This is a good starting point! Some thoughts:

  • Rename User_Post to Post. It's okay to call this node Post and have a POSTS relationship because we follow a naming convention for nodes/rels and it's easy to distinguish between them as we read the queries
  • The queries could be reformulated based on questions we're trying to ask about the data, for example:
    • Q1: Which user has the most followers, and how many are followers do they have?
    • Q2: Which user follows the most people, and how many users do they follow?
    • Q3: What is the shortest path between user A and user B?
    • Q4: How many 3-hop paths exist between user C and user D that pass through user A?

Along those lines. I'm not fully sure I follow the recommendation logic, but maybe flesh out those queries more.

Also, we need to think about how the output results are formatted and displayed so that we can explain them. Maybe having the COPY logic in one file and the queries run in another file make sense?

@sdht0
Copy link
Contributor

sdht0 commented Feb 5, 2025

we need to think about how the output results are formatted and displayed so that we can explain them.

Related, maybe we can show how to use the output in further processing, such as using them in other queries or exporting them in other formats?

Also, we can show how to perform parameterized queries.

@prrao87 prrao87 pinned this issue Feb 5, 2025
@prrao87
Copy link
Member Author

prrao87 commented Feb 5, 2025

OMG, yes, @sdht0 thanks for that callout - we totally should show parameterized queries ("prepared statements"). Please find a way to work that in @WWW0030 .

Place a new markdown section under the "Tutorials" section in the docs. When you make the PR, make it to the dev branch so that I can work on the organization of the page better after the 0.8.0 release.

@prrao87 prrao87 linked a pull request Feb 5, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants