Skip to content

Consider using github-to-sqlite to grab our activity dataset #76

Open
@choldgraf

Description

@choldgraf

Context

This tool is basically a two-step process.

  • In step 1, we grab as much data as we can about issues/PRs/comments/etc using the GitHub API. This is done with:
    • This function:
      def get_activity(
      target, since, until=None, repo=None, kind=None, auth=None, cache=None
      ):
      """Return issues/PRs within a date window.
      Parameters
      ----------
      target : string
      The GitHub organization/repo for which you want to grab recent issues/PRs.
      Can either be *just* and organization (e.g., `jupyter`) or a combination
      organization and repo (e.g., `jupyter/notebook`). If the former, all
      repositories for that org will be used. If the latter, only the specified
      repository will be used.
      since : string | None
      Return issues/PRs with activity since this date or git reference. Can be
      any string that is parsed with dateutil.parser.parse.
      until : string | None
      Return issues/PRs with activity until this date or git reference. Can be
      any string that is parsed with dateutil.parser.parse. If none, today's
      date will be used.
      kind : ["issue", "pr"] | None
      Return only issues or PRs. If None, both will be returned.
      auth : string | None
      An authentication token for GitHub. If None, then the environment
      variable `GITHUB_ACCESS_TOKEN` will be tried. If it does not exist,
      then attempt to infer a token from `gh auth status -t`.
      cache : bool | str | None
      Whether to cache the returned results. If None, no caching is
      performed. If True, the cache is located at
      ~/github_activity_data. It is organized as orgname/reponame folders
      with CSV files inside that contain the latest data. If a string it
      is treated as the path to a cache folder.
    • And this GraphQL query: https://github.com/executablebooks/github-activity/blob/master/github_activity/graphql.py
  • In step 2, we parse the resulting data, munge it, and output markdown, statistics, etc.

However, the functionality in step 1 is kind-of hacky and messy, and hard to reason with.

I recently came across a tool recommended by @simonw , which essentially replicates all of this functionality but with a more well-structured and maintainer implementation:

This is a python library that will grab all of the issues, pull requests, and comments (among other things) from a repository and store them in a local sqlite database so that you can do what you want with them. They are structured to be able to work with datasette as well (though we may not have use for that in this package, just FYI).

Two questions that I have and I'm not sure the answer:

  • How to speed it up. I'm not sure whether github-to-sqlite does any cacheing or allows you to filter by date. If not, then it might take quite a long time to run this interactively.
  • How to run via a Python API. All the examples use a CLI, and while this is probably fine it would be nice if we could grab / update datasets by running this as part of other scripts.

Proposal

What do folks think about re-using github-to-sqlite for our "grab all of the activity in a repository" step, and focusing this repository on the munging / filtering by date / calculating statistics / generating markdown aspects?

I think this might be a nice way to reduce some unnecessary complexity here and to re-use code from others in the ecosystem. I also like the idea of becoming familiar with datasette structures as is opens the possibility that we could expose this kind of data in the future for others in the community to munge and use.

At this point I'm just exploring the idea and curious what others think!

Tasks and updates

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions