|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "#### Big Data – Exercises - Solutions\n", |
| 8 | + "\n", |
| 9 | + "# Fall 2024 - Week 11 - RumbleDB" |
| 10 | + ] |
| 11 | + }, |
| 12 | + { |
| 13 | + "cell_type": "markdown", |
| 14 | + "metadata": {}, |
| 15 | + "source": [ |
| 16 | + "# Moodle quiz (11.2): querying a bigger git-archive dataset\n", |
| 17 | + "\n", |
| 18 | + "You will have to submit the results of this exercise to Moodle to obtain the weekly bonus. You will need these things:\n", |
| 19 | + "- Something related to the query output (we will grade this)\n", |
| 20 | + "- The query you wrote (ungraded)" |
| 21 | + ] |
| 22 | + }, |
| 23 | + { |
| 24 | + "cell_type": "markdown", |
| 25 | + "metadata": {}, |
| 26 | + "source": [ |
| 27 | + "This exercise was designed to run on the exam magic box (and tested there too). It should work on all systems, but if you run into issues there you can look at the tutorial on how to run docker on [GitHub codespaces](https://github.com/RumbleDB/bigdata-exercises/blob/master/Big_Data/exercise05/GitHub_Codespaces.pdf), or the alternative instructions in [last year's exercises](https://github.com/RumbleDB/bigdata-exercises/tree/08ba6ba6222d96003ad7bd895a71ab6c32bcc872/Big_Data/exercise11).\n", |
| 28 | + "\n", |
| 29 | + "To get started, run the cell below to properly connect jupyter with rumbleDB (don't worry about the cell, we don't expect you to know what this does)." |
| 30 | + ] |
| 31 | + }, |
| 32 | + { |
| 33 | + "cell_type": "code", |
| 34 | + "execution_count": null, |
| 35 | + "metadata": {}, |
| 36 | + "outputs": [], |
| 37 | + "source": [ |
| 38 | + "!pip install rumbledb\n", |
| 39 | + "%load_ext rumbledb\n", |
| 40 | + "%env RUMBLEDB_SERVER=http://rumble:9090/jsoniq" |
| 41 | + ] |
| 42 | + }, |
| 43 | + { |
| 44 | + "cell_type": "markdown", |
| 45 | + "metadata": {}, |
| 46 | + "source": [ |
| 47 | + "### Check the data\n", |
| 48 | + "We provide you with a bigger git-archive dataset [git-archive-big.json](https://www.rumbledb.org/samples/git-archive-big.json), you can already check that you get the correct number of records. The dataset should contain 206978 records. You can either use `wget` to download and read the dataset locally or simply read with `json-file` from the URI.\n", |
| 49 | + "\n", |
| 50 | + "We recommend running the cell below to download the data (reading it directly from the URL is slow and hard to debug using the notebook interface)." |
| 51 | + ] |
| 52 | + }, |
| 53 | + { |
| 54 | + "cell_type": "code", |
| 55 | + "execution_count": null, |
| 56 | + "metadata": {}, |
| 57 | + "outputs": [], |
| 58 | + "source": [ |
| 59 | + "!wget https://www.rumbledb.org/samples/git-archive-big.json" |
| 60 | + ] |
| 61 | + }, |
| 62 | + { |
| 63 | + "cell_type": "code", |
| 64 | + "execution_count": null, |
| 65 | + "metadata": {}, |
| 66 | + "outputs": [], |
| 67 | + "source": [ |
| 68 | + "%%jsoniq\n", |
| 69 | + "count(for $i in json-file(\"git-archive-big.json\")\n", |
| 70 | + "return $i)" |
| 71 | + ] |
| 72 | + }, |
| 73 | + { |
| 74 | + "cell_type": "code", |
| 75 | + "execution_count": null, |
| 76 | + "metadata": {}, |
| 77 | + "outputs": [], |
| 78 | + "source": [ |
| 79 | + "# json-file(\"https://www.rumbledb.org/samples/git-archive-big.json\") # to read it from the URL" |
| 80 | + ] |
| 81 | + }, |
| 82 | + { |
| 83 | + "cell_type": "code", |
| 84 | + "execution_count": null, |
| 85 | + "metadata": {}, |
| 86 | + "outputs": [], |
| 87 | + "source": [ |
| 88 | + "%%jsoniq\n", |
| 89 | + "distinct-values(json-file(\"git-archive-big.json\").type)" |
| 90 | + ] |
| 91 | + }, |
| 92 | + { |
| 93 | + "cell_type": "markdown", |
| 94 | + "metadata": {}, |
| 95 | + "source": [ |
| 96 | + "## Question 1: Give the login name of the two actors that committed to master the most in PushEvent events.\n", |
| 97 | + "\n", |
| 98 | + "Write the two names, separated by a comma with no space in between them.\n", |
| 99 | + "\n", |
| 100 | + "Hint: Note that all commits in a push event are stored in a list (you should count those as distinct commits)." |
| 101 | + ] |
| 102 | + }, |
| 103 | + { |
| 104 | + "cell_type": "code", |
| 105 | + "execution_count": null, |
| 106 | + "metadata": {}, |
| 107 | + "outputs": [], |
| 108 | + "source": [ |
| 109 | + "%%jsoniq\n", |
| 110 | + "for $i in json-file(\"git-archive-big.json\")\n", |
| 111 | + "where $i.type eq \"PushEvent\" and $i.payload.ref eq \"refs/heads/master\"\n", |
| 112 | + "group by $ac := $i.actor.login\n", |
| 113 | + "let $cnt := count($i.payload.commits[])\n", |
| 114 | + "order by $cnt descending\n", |
| 115 | + "count $c\n", |
| 116 | + "where $c le 3\n", |
| 117 | + "return [$cnt, $ac]" |
| 118 | + ] |
| 119 | + }, |
| 120 | + { |
| 121 | + "cell_type": "markdown", |
| 122 | + "metadata": {}, |
| 123 | + "source": [ |
| 124 | + "## Question 2: For how many repos do we have both a creation and deletion event in the data?\n", |
| 125 | + "\n", |
| 126 | + "Write the number and nothing else." |
| 127 | + ] |
| 128 | + }, |
| 129 | + { |
| 130 | + "cell_type": "code", |
| 131 | + "execution_count": null, |
| 132 | + "metadata": {}, |
| 133 | + "outputs": [], |
| 134 | + "source": [ |
| 135 | + "%%jsoniq\n", |
| 136 | + "count(for $i in json-file(\"git-archive-big.json\")\n", |
| 137 | + "where $i.type = \"CreateEvent\" or $i.type = \"DeleteEvent\"\n", |
| 138 | + "group by $repo := $i.repo.id\n", |
| 139 | + "where count(distinct-values($i.type)) eq 2\n", |
| 140 | + "return $repo)" |
| 141 | + ] |
| 142 | + }, |
| 143 | + { |
| 144 | + "cell_type": "markdown", |
| 145 | + "metadata": {}, |
| 146 | + "source": [ |
| 147 | + "## Question 3: Which repository has the highest number of people pushing to it?\n", |
| 148 | + "\n", |
| 149 | + "Give both the repository id and the number of people, separated by a comma with spaces in between.\n", |
| 150 | + "\n", |
| 151 | + "Hint: Differentiate users (_actors_) using their actor id." |
| 152 | + ] |
| 153 | + }, |
| 154 | + { |
| 155 | + "cell_type": "code", |
| 156 | + "execution_count": null, |
| 157 | + "metadata": {}, |
| 158 | + "outputs": [], |
| 159 | + "source": [ |
| 160 | + "%%jsoniq\n", |
| 161 | + "for $i in json-file(\"git-archive-big.json\")\n", |
| 162 | + "where $i.type = \"PushEvent\"\n", |
| 163 | + "group by $repo := $i.repo.id\n", |
| 164 | + "let $people := count(distinct-values($i.actor.id))\n", |
| 165 | + "order by $people descending\n", |
| 166 | + "count $c\n", |
| 167 | + "where $c le 10\n", |
| 168 | + "return [$repo, $people]" |
| 169 | + ] |
| 170 | + } |
| 171 | + ], |
| 172 | + "metadata": { |
| 173 | + "anaconda-cloud": {}, |
| 174 | + "kernelspec": { |
| 175 | + "display_name": "Python 3 (ipykernel)", |
| 176 | + "language": "python", |
| 177 | + "name": "python3" |
| 178 | + }, |
| 179 | + "language_info": { |
| 180 | + "codemirror_mode": { |
| 181 | + "name": "ipython", |
| 182 | + "version": 3 |
| 183 | + }, |
| 184 | + "file_extension": ".py", |
| 185 | + "mimetype": "text/x-python", |
| 186 | + "name": "python", |
| 187 | + "nbconvert_exporter": "python", |
| 188 | + "pygments_lexer": "ipython3", |
| 189 | + "version": "3.11.9" |
| 190 | + } |
| 191 | + }, |
| 192 | + "nbformat": 4, |
| 193 | + "nbformat_minor": 4 |
| 194 | +} |
0 commit comments